Preservation Use Case Choosing a dataset

From Earth Science Information Partners (ESIP)

Choosing a data set from multiple similar choices.

Summary

A research user needs to pick the data set from multiple similar data sets that best meets the user’s requirements for their intended application. An example could be a polar bear ecologist choosing a data set on sea ice conditions in a region of the Hudson Bay from the multiple data sets listed at NSIDC. Another example could be a user choosing which sea surface temperature data set from PO.DAAC to use in forcing a model of an agal bloom. Many other examples exist. Traditionally this was done by the user consulting a relevant expert. Ideally, one could conceive of an expert system helping guide the user through their query, if the system had access to sufficient information.

Relevant experts may include:

  • science domain experts that know the science applications.
  • instrument experts that know the subtleties of the observation mechanism.
  • algorithm experts that know the variations in retrievals.
  • process experts that know the subtleties of the processing implementation.
  • data format experts that know handling of for example HDF4 vs HDF5.

How best to capture the "gotchas" potentially introduced each step along the way?

Suitability of data usage

  • Mapping observations (e.g. variables) to appropriate science focus areas.

Actors

  • Research user
  • Data expert(s)/Expert system
    • science domain experts that know the science applications.
    • instrument experts that know the subtleties of the observation mechanism.
    • algorithm experts that know the variations in retrievals.
    • process experts that know the subtleties of the processing implementation.
    • data format experts that know handling of for example HDF4 vs HDF5.
  • Archive

Sequence of Events

  1. User poses initial request to expert
  2. Expert queries user on specifics
  3. Iteration between user and expert to understand vocabularies and actual needs
    1. Initial possible data sets are identified by basic criteria like whether the data set covers the right time and location
    2. The list is further refined by more qualitative criteria specific to the actual query
  4. A recommended data set or ranked list of data sets is returned to the user

PCCS Artifacts

  • Data usage information
    • Informal feedback from users (e.g., Amazon-style comments)
    • publications about the data
    • Publications that use the data
  • Data “peer review” information. This is ill defined, but could include
    • audit information about practices and processes to produce and maintain the data
    • Advise from scientific advisory groups, etc.
    • ….
  • Authority or certification information
    • Who is the authority (if there is one) that is ascerting hat the data meet certain quality criteria (e.g. Nat. Weather Service)
    • Criteria used in the certification

Notes

  • Relevant experts may include:
    • science domain experts that know the science applications.
    • instrument experts that know the subtleties of the observation mechanism.
    • algorithm experts that know the variations in retrievals.
    • process experts that know the subtleties of the processing implementation.
    • data format experts that know handling of for example HDF4 vs HDF5.
  • How best to capture the "gotchas" potentially introduced each step along the way?
  • Suitability of data usage
    • Mapping observations (e.g. variables) to appropriate science focus areas.