# QC Resources

### Publications about Quality Control

This document contains a growing list of resources (peer-reviewed publications) about quality control that we find useful. For publications about sensors, equipment, or networks, please upload to the Sensor Resources page.

Sensor networks are revolutionizing environmental monitoring by producing massive quantities of data that are being made publically available in near real time. These data streams pose a challenge for ecologists because traditional approaches to quality assurance and quality control are no longer practical when confronted with the size of these data sets and the demands of real-time processing. Automated methods for rapidly identifying and (ideally) correcting problematic data are essential. However, advances in sensor hardware have outpaced those in software, creating a need for tools to implement automated quality assurance and quality control procedures, produce graphical and statistical summaries for review, and track the provenance of the data. Use of automated tools would enhance data integrity and reliability and would reduce delays in releasing data products. Development of community-wide standards for quality assurance and quality control would instill confidence in sensor data and would improve interoperability across environmental sensor networks.

Missing data represent a general problem in many scientific fields above all in environmental research. Several methods have been proposed in literature for handling missing data and the choice of an appropriate method depends, among others, on the missing data pattern and on the missing-data mechanism. One approach to the problem is to impute them to yield a complete data set. The goal of this paper is to propose a new single imputation method and to compare its performance to other single and multiple imputation methods known in literature. Considering a data set of PM10 concentration measured every 2 h by eight monitoring stations distributed over the metropolitan area of Palermo, Sicily, during 2003, simulated incomplete data have been generated, and the performance of the imputation methods have been compared on the correlation coefficient ðrÞ, the index of agreement (d), the root mean square deviation (RMSD) and the mean absolute deviation (MAD). All the performance indicators agree to evaluate the proposed method as the best among the ones compared, independently on the gap length and on the number of stations with missing data.

(abstract) The National Data Buoy Center (NDBC) has an extensive program to reduce greatly the chances of transmitting degraded measurements. These measurements are taken from its network of buoys and Coastal-Marine Automated Network (C-MAN) stations (Meindl and Hamilton, 1992). This paper discusses improvements to the real-time software that automatically validates the observations. In effect, these improvements codify rules data analysts—who have years of experience with our system—use to detect degraded data.

(abstract) A recent comprehensive effort to digitize U.S. daily temperature and precipitation data observed prior to 1948 has resulted in a major enhancement in the computer database of the records of the National Weather Service’s cooperative observer network. Previous digitization efforts had been selective, concentrating on state or regional areas. Special quality control procedures were applied to these data to enhance their value for climatological analysis. The procedures involved a two-step process. In the first step, each individual temperature and precipitation data value was evaluated against a set of objective screening criteria to flag outliers. These criteria included extreme limits and spatial comparisons with nearby stations. The following data were automatically flagged: 1) all precipitation values exceeding 254 mm (10 in.) and 2) all temperature values whose anomaly from the monthly mean for that station exceeded five standard deviations. Addi- tional values were flagged based on differences with nearby stations; in this case, metrics were used to rank outliers so that the limited resources were concentrated on those values most likely to be invalid. In the second step, each outlier was manually assessed by climatologists and assigned one of the four following flags: valid, plausible, questionable, or invalid. In excess of 22 400 values were manually assessed, of which about 48% were judged to be invalid. Although additional manual assessment of outliers might further improve the quality of the database, the procedures applied in this study appear to have been successful in identifying the most flagrant errors.

(abstract) Estimating the mean and the covariance matrix of an incomplete dataset and filling in missing values with imputed values is generally a nonlinear problem, which must be solved iteratively. The expectation maximization (EM) algorithm for Gaussian data, an iterative method both for the estimation of mean values and covariance matrices from incomplete datasets and for the imputation of missing values, is taken as the point of departure for the development of a regularized EM algorithm. In contrast to the conventional EM algorithm, the regularized EM algorithm is applicable to sets of climate data, in which the number of variables typically exceeds the sample size. The regularized EM algorithm is based on iterated analyses of linear regressions of variables with missing values on variables with available values, with regression coefficients estimated by ridge regression, a regularized regression method in which a continuous regularization parameter controls the filtering of the noise in the data. The regularization parameter is determined by generalized cross-validation, such as to minimize, approximately, the expected mean-squared error of the imputed values. The regularized EM algorithm can estimate, and exploit for the imputation of missing values, both synchronic and diachronic covariance matrices, which may contain information on spatial covariability, stationary temporal covariability, or cyclostationary temporal covariability. A test of the regularized EM algorithm with simulated surface temperature data demonstrates that the algorithm is applicable to typical sets of climate data and that it leads to more accurate estimates of the missing values than a conventional noniterative imputation technique.

Systematic planning is used to develop programs and to link program goals with cost, schedule, and quality criteria for the collection, eva- luation, or use of data. Under the U.S. Environmental Protection Agency’s (EPA’s) quality system, the data quality objective (DQO) process was developed to assist systematic planning (U.S. EPA, 2000; Batterman et al., 1999). While not mandatory, this process is EPA’s recommended planning approach for many environmental data collec- tion activities. It is based on the assumption that the ultimate goal for these activities is to make some decision (e.g., a regulatory compliance determination). This process uses a statistical approach to establish DQOs, which are qualitative or quantitative statements that clarify project objectives, that define the appropriate type of data, and that specify tolerable error levels for the decisions. The process finally de- velops a quality assurance (QA) project plan, including measurement quality objectives (MQOs), to collect data with uncertainties within these tolerable error levels.

(abstract) Collecting natural data at regular, fine scales is an onerous and often costly procedure. However, there is a basic need for fine scale data when applying inductive methods such as neural networks or genetic algorithms for the development of ecological models. This paper will address the issues involved in interpolating data for use in machine learning methods by considering how to determine if a downscaling of the data is valid. The approach is based on a multi-scale estimate of errors. The resulting function has similar properties to a time series variogram; however, the comparison at different scales is based on the variance introduced by rescaling from the original sequence. This approach has a number of properties, including the ability to detect frequencies in the data below the current sampling rate, an estimate of the probable average error introduced when a sampled variable is downscaled and a method for visualising the sequences of a time series that are most susceptible to error due to sampling. The described approach is ideal for supporting the ongoing sampling of ecological data and as a tool for assessing the impact of using interpolated data for building inductive models of ecological response.

(abstract) Quality assurance (QA) procedures have been automated to reduce the time and labor necessary to discover outliers in weather data. Measurements from neighboring stations are used in this study in a spatial regression test to provide preliminary estimates of the measured data points. The new method does not assign the largest weight to the nearest estimate but, instead, assigns the weights according to the standard error of estimate. In this paper, the spatial test was employed to study patterns in flagged data in the following extreme events: the 1993 Midwest floods, the 2002 drought, Hurricane Andrew (1992), and a series of cold fronts during October 1990. The location of flagged records and the influence zones for such events relative to QA were compared. The behavior of the spatial test in these events provides important information on the probability of making a type I error in the assignment of the quality control flag. Simple pattern recognition tools that identify zones wherein frequent flagging occurs are illustrated. These tools serve as a means of resetting QA flags to minimize the number of type I errors as demonstrated for the extreme events included here.

(abstract) With the widespread use of electronic interfaces in data collection, many networks have increased, or will increase, the sampling rate and add more sensors. The associated increase in data volume will naturally lead to an increased reliance on automatic quality assurance (QA) procedures. The number of data entries flagged for further manual validation can be affected by the choice of confidence intervals in statistically based QA procedures, which in turn affects the number of bad entries classified as good measurements. At any given station, a number of confidence intervals for the Spatial Regression Test (SRT) were specified and tested in this study, using historical data for both the daily minimum (Tmin) and maximum (Tmax), to determine how the frequency of flagging is related to the choice of confidence interval. An assessment of the general relationship of the number of data flagged to the specified confidence interval over a set of widely dispersed stations in the High Plains was undertaken to determine whether a single confidence factor would suffice, at all stations, to identify a moderate number of flags. This study suggests that using a confidence factor ‘f ’ larger than 2.5 to specify the confidence interval will flag a reasonable number of measurements (<1%) for further manual validation and a single confidence factor can be applied for a state. This paper initially compares two formulations of the SRT method. This comparison is followed by an analysis of the percentage of observations flagged as a function of confidence interval.