Difference between revisions of "QC Resources"

From Earth Science Information Partners (ESIP)
Line 11: Line 11:
  
 
(abstract) Estimating the mean and the covariance matrix of an incomplete dataset and filling in missing values with imputed values is generally a nonlinear problem, which must be solved iteratively. The expectation maximization (EM) algorithm for Gaussian data, an iterative method both for the estimation of mean values and covariance matrices from incomplete datasets and for the imputation of missing values, is taken as the point of departure for the development of a regularized EM algorithm. In contrast to the conventional EM algorithm, the regularized EM algorithm is applicable to sets of climate data, in which the number of variables typically exceeds the sample size. The regularized EM algorithm is based on iterated analyses of linear regressions of variables with missing values on variables with available values, with regression coefficients estimated by ridge regression, a regularized regression method in which a continuous regularization parameter controls the filtering of the noise in the data. The regularization parameter is determined by generalized cross-validation, such as to minimize, approximately, the expected mean-squared error of the imputed values. The regularized EM algorithm can estimate, and exploit for the imputation of missing values, both synchronic and diachronic covariance matrices, which may contain information on spatial covariability, stationary temporal covariability, or cyclostationary temporal covariability. A test of the regularized EM algorithm with simulated surface temperature data demonstrates that the algorithm is applicable to typical sets of climate data and that it leads to more accurate estimates of the missing values than a conventional noniterative imputation technique.
 
(abstract) Estimating the mean and the covariance matrix of an incomplete dataset and filling in missing values with imputed values is generally a nonlinear problem, which must be solved iteratively. The expectation maximization (EM) algorithm for Gaussian data, an iterative method both for the estimation of mean values and covariance matrices from incomplete datasets and for the imputation of missing values, is taken as the point of departure for the development of a regularized EM algorithm. In contrast to the conventional EM algorithm, the regularized EM algorithm is applicable to sets of climate data, in which the number of variables typically exceeds the sample size. The regularized EM algorithm is based on iterated analyses of linear regressions of variables with missing values on variables with available values, with regression coefficients estimated by ridge regression, a regularized regression method in which a continuous regularization parameter controls the filtering of the noise in the data. The regularization parameter is determined by generalized cross-validation, such as to minimize, approximately, the expected mean-squared error of the imputed values. The regularized EM algorithm can estimate, and exploit for the imputation of missing values, both synchronic and diachronic covariance matrices, which may contain information on spatial covariability, stationary temporal covariability, or cyclostationary temporal covariability. A test of the regularized EM algorithm with simulated surface temperature data demonstrates that the algorithm is applicable to typical sets of climate data and that it leads to more accurate estimates of the missing values than a conventional noniterative imputation technique.
 +
 +
* [[Media: You 2007 - Relationship of flagging frequency to confidence intervals.pdf| Relationship of flagging frequency to confidence intervals, You, 2007]]
 +
 +
(abstract) With the widespread use of electronic interfaces in data collection, many networks have increased, or will increase, the sampling rate and add more sensors. The associated increase in data volume will naturally lead to an increased reliance on automatic quality assurance (QA) procedures. The number of data entries flagged for further manual validation can be affected by the choice of confidence intervals in statistically based QA procedures, which in turn affects the number of bad entries classified as good measurements. At any given station, a number of confidence intervals for the Spatial Regression Test (SRT) were specified and tested in this study, using historical data for both the daily minimum (Tmin) and maximum (Tmax), to determine how the frequency of flagging is related to the choice of confidence interval. An assessment of the general relationship of the number of data flagged to the specified confidence interval over a set of widely dispersed stations in the High Plains was undertaken to determine whether a single confidence factor would suffice, at all stations, to identify a moderate number of flags. This study suggests that using a confidence factor ‘f ’ larger than 2.5 to specify the confidence interval will flag a reasonable number of measurements (<1%) for further manual validation and a single confidence factor can be applied for a state. This paper initially compares two formulations of the SRT method. This comparison is followed by an analysis of the percentage of observations flagged as a function of confidence interval.

Revision as of 18:23, January 26, 2015

Publications about Quality Control

(abstract) The National Data Buoy Center (NDBC) has an extensive program to reduce greatly the chances of transmitting degraded measurements. These measurements are taken from its network of buoys and Coastal-Marine Automated Network (C-MAN) stations (Meindl and Hamilton, 1992). This paper discusses improvements to the real-time software that automatically validates the observations. In effect, these improvements codify rules data analysts—who have years of experience with our system—use to detect degraded data.

(abstract) A recent comprehensive effort to digitize U.S. daily temperature and precipitation data observed prior to 1948 has resulted in a major enhancement in the computer database of the records of the National Weather Service’s cooperative observer network. Previous digitization efforts had been selective, concentrating on state or regional areas. Special quality control procedures were applied to these data to enhance their value for climatological analysis. The procedures involved a two-step process. In the first step, each individual temperature and precipitation data value was evaluated against a set of objective screening criteria to flag outliers. These criteria included extreme limits and spatial comparisons with nearby stations. The following data were automatically flagged: 1) all precipitation values exceeding 254 mm (10 in.) and 2) all temperature values whose anomaly from the monthly mean for that station exceeded five standard deviations. Addi- tional values were flagged based on differences with nearby stations; in this case, metrics were used to rank outliers so that the limited resources were concentrated on those values most likely to be invalid. In the second step, each outlier was manually assessed by climatologists and assigned one of the four following flags: valid, plausible, questionable, or invalid. In excess of 22 400 values were manually assessed, of which about 48% were judged to be invalid. Although additional manual assessment of outliers might further improve the quality of the database, the procedures applied in this study appear to have been successful in identifying the most flagrant errors.

(abstract) Estimating the mean and the covariance matrix of an incomplete dataset and filling in missing values with imputed values is generally a nonlinear problem, which must be solved iteratively. The expectation maximization (EM) algorithm for Gaussian data, an iterative method both for the estimation of mean values and covariance matrices from incomplete datasets and for the imputation of missing values, is taken as the point of departure for the development of a regularized EM algorithm. In contrast to the conventional EM algorithm, the regularized EM algorithm is applicable to sets of climate data, in which the number of variables typically exceeds the sample size. The regularized EM algorithm is based on iterated analyses of linear regressions of variables with missing values on variables with available values, with regression coefficients estimated by ridge regression, a regularized regression method in which a continuous regularization parameter controls the filtering of the noise in the data. The regularization parameter is determined by generalized cross-validation, such as to minimize, approximately, the expected mean-squared error of the imputed values. The regularized EM algorithm can estimate, and exploit for the imputation of missing values, both synchronic and diachronic covariance matrices, which may contain information on spatial covariability, stationary temporal covariability, or cyclostationary temporal covariability. A test of the regularized EM algorithm with simulated surface temperature data demonstrates that the algorithm is applicable to typical sets of climate data and that it leads to more accurate estimates of the missing values than a conventional noniterative imputation technique.

(abstract) With the widespread use of electronic interfaces in data collection, many networks have increased, or will increase, the sampling rate and add more sensors. The associated increase in data volume will naturally lead to an increased reliance on automatic quality assurance (QA) procedures. The number of data entries flagged for further manual validation can be affected by the choice of confidence intervals in statistically based QA procedures, which in turn affects the number of bad entries classified as good measurements. At any given station, a number of confidence intervals for the Spatial Regression Test (SRT) were specified and tested in this study, using historical data for both the daily minimum (Tmin) and maximum (Tmax), to determine how the frequency of flagging is related to the choice of confidence interval. An assessment of the general relationship of the number of data flagged to the specified confidence interval over a set of widely dispersed stations in the High Plains was undertaken to determine whether a single confidence factor would suffice, at all stations, to identify a moderate number of flags. This study suggests that using a confidence factor ‘f ’ larger than 2.5 to specify the confidence interval will flag a reasonable number of measurements (<1%) for further manual validation and a single confidence factor can be applied for a state. This paper initially compares two formulations of the SRT method. This comparison is followed by an analysis of the percentage of observations flagged as a function of confidence interval.