# Chapman Conference

# Errors and Uncertainties in Observing a Changing Earth: Informatics and Statistical Issues

## Scientific Challenges

Providing a reliable assessment of the uncertainty in the data stored in Earth science repositories is a daunting task. However, such assessments are needed by numerous user communities, including those engaged in protecting public health and safety, in providing insurance and risk assessments, and those involved in both public and private planning for climate mitigation and adaptation. The proposed Conference is intended to explore how Earth science informatics and statistics might assist in providing useful assessments to various user communities through such mechanisms as interactive environments or “serious games”. It also intends to create easy-to-use models for data providers to express the uncertainty of their data and for users (including policy makers) to understand the uncertainty.

In assessing uncertainties, there are serious difficulties with organizing and presenting errors and uncertainties for user communities. One element of this difficulty lies in the complexity of the error sources described below.

## Conference Logistics

### Possible locations

- mountains near Asheville (proximity to NCDC)
- foothills near Boulder (proximity to NCAR)
- desert near Los Angeles (proximity to JPL)

### Convenors

- Bruce Barkstrom
- Peter Fox, RPI
- Rob Raskin, JPL

### Potential Program Committees

- Mark Berliner, Dept of Statistics, Ohio State U. (confirmed)
- Amy Braverman, JPL (confirmed)
- Doug Nychka, Geophysical Statistics Project, NCAR
- Karen Kafadar, Dept of Statistics, Indiana U. (confirmed)
- Rick Rood (Peter)
- R. L. Smith, UNC (Bruce)
- Caspar Amman (Peter)
- Michael McIntyre (Peter)
- Bryan Baum (Bruce)
- Claudia Steubenrach, ISCCP (Bruce)

Paleo (Peter) Error budget (Amy)

### Potential Co-Sponsors

- DOE
- ESIP Federation (confirmed)
- NASA (through Martha Maiden)
- NSF (easy if extension to existing grant from same Directorate)
- NOAA
- ESI (logo)
- AMS (logo)
- ESSI (logo)
- IUGG (logo)
- NIEHS

### Communities

- Climate (incl. Australia)
- Clouds?
- Water Management (incl. Australia)
- Health/epidemiolog
- other GEOSS

### Ombudsmen

- Jim Frew
- Charlie Barton

### Format

- Mixture of plenary and breakouts by varying criteria

### Time frame

- Late 2010 or early 2011

## Scientific Program

### Goals

- Create a framework for presenting and judging errors and

uncertainties in the context of a changing Earth

- Raising awareness of intellectual disciplines dealing with

errors and uncertainties - training educators to better educate students - senior outreach

- New techniques/best practices [def of new discipline]
- Bridge the gap between statisticians and Earth system scientists
- Reconcile various community approaches to uncertainty (kriging, data assimilation)
- Common measures of data quality
- Tools for improved visualization and estimation of uncertainty
- Models for representing uncertainty that can be applied by non-statisticians
- Community-wide estimation of covariance functions
- Better provenance representations of uncertainty

### Tangible outcomes

- Edited AGU monograph (detailing the state of the practice)
- Links to data/numerical experiments from eBook text
- Statement of direction
- Community of practice
- Framework for a new discipline [application of informatics

to presenting mathematical and statistical methods]

### Science Disciplinary Scope

- Climate, water resources, air quality

### Use Cases

#### * IPCC Assessment

#### * Instrument uncertainty model

#### * Water resource planning

#### * Non-Gaussian Point Spread Functions

### Sources of Error

#### Positional and Coverage Measures

- Horizontal, vertical, and time resolution
- Positional certainty
- Cloud cover, aerosols, and trace gases, and other measures of surface visibility
- Geographic coverage (How does this differ or relate to horizontal resolution?)

To address these aspects properly (or rigorously), it is critical to know the Point Spread Function (or its Fourier transform - the Modulation Transfer Function), the variability in the underlying geophysical field, and other forms of uncertainty introduced by the retrieval process. The mathematics of PSF calls for inversion of a Fredholm integral equation of the first kind to interpolate between points, which usually is very expensive. The typical workarounds of deploying nearest neighbor or cubic spline interpolation are not theoretically justified. To capture horizontal resolution correctly requires knowledge of both the noise in the measurement and the PSF. There is a coherent statistical framework for relating measurements (either point measurements or pixel type measurements) to the underlying continuous physical field about which one would like to make inferences. This framework comes with a mechanism for quantifying and propagating uncertainties, and goes by the name “spatial statistics”. Cloud cover (as a measure of surface visibility) is dependent on the size of the pixels (and their overlap) and has a different meaning than in its climate energetics sense. An example could be provided using a selection of ASTER images that lie inside MODIS images and constructing a scatter plot of "cloud cover" for pairs of images that look at the same regions. In CERES, we used the latitudinal "cloud cover" as a quality control metric, since we had global coverage in the Level 2 data products and that metric didn't change from day to day at approximately the one percent level.

Informatics Challenges

- Provide examples of the impact of PSF and noise tradeoffs on horizontal resolution and positional uncertainty that are useful for GIS applications
- Provide reliable taxonomies for semantic distinctions between various data quality terms, such as the meaning of “cloud cover”
- Providing means of preserving uncertainty estimates for positional and coverage measures over the long term, including the certainty that understanding and data will evolve

Statistics Challenges

- Formulate simple spatial statistical models that can unify the representation of relationships between known and unknown quantities and allow for rigorous inference and quantification of uncertainties.
- Connect these models in a useful way (not purely abstract) to the recording and presentation of data for use in science investigations. Example: the use of simulation to illustrate the consequences of uncertainty modeled probabilistically.
- Communicate these ideas to the science and user communities.

#### Radiometric Accuracy

- Calibration accuracy
- Spectral knowledge
- Noise characteristics
- Contamination and other sources of long-term drift and change

The solar constant gives a relatively pure example of the importance of radiometric accuracy. In astronomy, radiometric calibration is known to cause the longest delays in producing data products. For climate, scientific progress will likely need accuracies below one percent over a decade or more. Radiometric calibration uses very different techniques for different kinds of instrumentation. An additional informatics task involves ensuring preservation of calibration data, calibration plans and procedures, as well as the often tacit knowledge that instrument developers may apply in editing the basic calibration data. A commonly used approach for developing uncertainty estimates of calibration error is to use error budgets. These often rely on Gaussian statistics and simplistic assumptions regarding constant biases.

Informatics Challenges

- Develop a scholarly approach to radiometric error budgets with a common structure of error sources and calibration processes (insofar as these fit within the physics appropriate to a particular kind of instrumentation)
- Develop methods of documentation that may include interactive workflows that could be used to understand the impact of instrument provider assumptions
- Develop an approach to making the calibration algorithms accessible and traceable to data scholars, even in the presence of large code bases and large databases

Statistics Challenges

- Break out different sources of error (uncertainty) in spatial (or spatio-temporal) statistical models separately such as radiometric uncertainty, calibration uncertainty, etc. Each contribution to the error is modeled as a separate error term.
- Explore how nonparametric and robust statistical modeling can help us move away from purely Gaussian assumptions.

#### Instantaneous Geophysical Parameter Accuracy

- Radiative Transfer Accuracy
- Inversion Methodology Accuracy and Resolution

As Earth science data providers move beyond radiometrically calibrated data, they convert the instantaneous measurements into other instantaneous quantities. In many cases, this conversion involves using radiative transfer to relate the measured radiances to the instantaneous geophysical parameters of interest. It seems reasonable to divide the errors into two categories: those arising from radiative transfer modeling and those arising from the mathematics of the “inversion” process. There is a substantial difference between visible radiative transfer (with monochromatic multiple scattering which is sensitive to cloud and aerosol particle properties) and IR radiative transfer (which involves the complexities of molecular spectra).

In the inversion process, the mathematics of inversion is known to involve a tradeoff between accuracy and vertical resolution. In addition, the inversion treatment of vertical structure in the presence of clouds is non-trivial.

Informatics Challenges

- Develop a scholarly approach to assessing the uncertainties involved in inverting remotely sensed data to produce instantaneous geophysical parameters (Optimal estimation?)
- Develop methods of visualizing the four-dimensional error structure and of helping provide an understanding of the influence of various assumptions and algorithms
- Develop methods of documenting and preserving the evolution of knowledge regarding uncertainties in this areas

Statistics Challenges

- Modeling different sources of error in the retrieval process individually.

#### Time and Space Averaging Accuracy

- Systematic Time Variations
- Systematic Spatial Variations

Most data uses involve building and working with time and space averaged data. There are two very important statistical issues here. First, pixel measurements are themselves spatial averages and thus fit well within the spatial statistical framework. Spatial statistics came about in an effort to represent the relationships between uncertain spatial entities. However, producing such averages involves interpolating between data samples that are often quite sparse. Simply adding measurements and dividing by the number of samples does not remove systematic variations. Moreover, it does not account for spatial and temporal dependence, which can lead to serious errors. For example, there are notable differences between the temporal variations of fields over the ocean and over land. Emitted longwave radiation over the Sahara is quite directly correlated with the time variation of solar irradiance, whereas the same quantity may be nearly independent of solar irradiance over the ocean (at least for clear skies).

Informatics Challenges

- Develop a scholarly approach to documenting assumptions and mathematical treatments of spatial and temporal variability
- Develop methods of preserving, accessing, and understanding the evolution of knowledge regarding errors and uncertainties caused by the treatment of time and space averaging

Statistics Challenges

- The main challenge here is simplifying and communicating spatial statistical principles to people who are not trained formally in statistics. The theory is there to handle the problems discussed above, but putting in terms that are really useful will be hard.

#### Metrics in the face of Non-stationary Statistics

- Climate change induces long-term, non-linear trends
- No reason to expect trend changes are Gaussian

There are many fundamental uncertainties in the mathematical formulation of errors and uncertainties. Instrument providers often treat systematic errors as asymptotically constant biases, while they treat other error sources as Gaussian white noise with a constant standard deviation. These assumptions do not appear realistic, particularly when dealing with such problems as instrument contamination or instrument operation under operational conditions. In addition, climate change appears likely to introduce phenomena, such as nonlinear trends or changes in extreme value statistics that lie outside the bounds of classical analytic mathematics.

Informatics Challenges

- Develop mathematical and computational methods that lead to reliable and understandable treatments of uncertainties and errors. Many such techniques already exist, but the challenge is making them practical in this setting.
- Develop methods of visualizing and presenting such non-standard assessments to users
- Develop methods of preserving the evolution of understanding of these uncertainty estimates

Statistics Challenges

- Fit a multidimensional probability density function to dependent data
- Adapting computationally intensive methods (Markov Chain Monte Carlo for example) for routine use in computationally constrained environments.

### The Usability Challenge

For valuation purposes, data uses can be divided into

- Catastrophe warning (tsunamis, tornadoes): 3 hour latency, major value: service continuity
- Weather planning: 2 week coverage, major value: service continuity
- Process understanding: 1-2 year observation
- Short term statistics: 5 year record (crop damage insurance underwriting)
- Mid-term statistics: 25 year record (power plant siting) - trends important
- Long-term statistics: 100 year record (100 yr flood estimate) - extreme value distribution
- Doomsday statistics: >100 year record - catastrophic climate change

Much of the work on uncertainties and errors has used simple analytic methods that are easily presented in static form. For example, results are often stated in terms of significance levels, even though these measures have received substantive criticism from well-informed scientists, mathematicians, and statisticians. Use of Bayesian approaches has been relatively muted.

An additional component of usability lies in the need to incorporate social science and economic impacts of data. Most uses of data on longer time scales involve developing quantitative and probabilistic estimates of the impact of various climate-related phenomena. Furthermore, data users typically need these quantifications on spatial scales that are relatively small compared with the scales used for climate modeling on a global scale.

Informatics Challenges

- Develop modeling and visualization tools to present quantitative error and uncertainty assessments in a way that is useful to data end-users
- Develop methods of coupling uncertainty estimates with economic models in a way that allows end-users to assess the probability of significant impact on regional economies
- Develop environments, visualization tools, and “serious games” that would allow end-users to understand and develop mechanisms and policies that would be politically acceptable while mitigating the risk of untoward problems

Statistics Challenges

- Develop state-space models for propagating errors from data to climate predictions to decision support models.
- Run simulations using inputs drawn from probability distributions with characteristics that can be varied to see possible likely and unlikely impacts.