Preservation Use Case Choosing a dataset
Choosing a data set from multiple similar choices.
Summary
A research user needs to pick the data set from multiple similar data sets that best meets the user’s requirements for their intended application. An example could be a polar bear ecologist choosing a data set on sea ice conditions in a region of the Hudson Bay from the multiple data sets listed at NSIDC. Another example could be a user choosing which sea surface temperature data set from PO.DAAC to use in forcing a model of an agal bloom. Many other examples exist. Traditionally this was done by the user consulting a relevant expert. Ideally, one could conceive of an expert system helping guide the user through their query, if the system had access to sufficient information.
Relevant experts may include:
- science domain experts that know the science applications.
- instrument experts that know the subtleties of the observation mechanism.
- algorithm experts that know the variations in retrievals.
- process experts that know the subtleties of the processing implementation.
- data format experts that know handling of for example HDF4 vs HDF5.
How best to capture the "gotchas" potentially introduced each step along the way?
Suitability of data usage
- Mapping observations (e.g. variables) to appropriate science focus areas.
Actors
- Research user
- Data expert(s)/Expert system
- science domain experts that know the science applications.
- instrument experts that know the subtleties of the observation mechanism.
- algorithm experts that know the variations in retrievals.
- process experts that know the subtleties of the processing implementation.
- data format experts that know handling of for example HDF4 vs HDF5.
- Archive
Sequence of Events
- User poses initial request to expert
- Expert queries user on specifics
- Iteration between user and expert to understand vocabularies and actual needs
- Initial possible data sets are identified by basic criteria like whether the data set covers the right time and location
- The list is further refined by more qualitative criteria specific to the actual query
- A recommended data set or ranked list of data sets is returned to the user
PCCS Artifacts
- Data usage information
- Informal feedback from users (e.g., Amazon-style comments)
- publications about the data
- Publications that use the data
- Data “peer review” information. This is ill defined, but could include
- audit information about practices and processes to produce and maintain the data
- Advise from scientific advisory groups, etc.
- ….
- Authority or certification information
- Who is the authority (if there is one) that is ascerting hat the data meet certain quality criteria (e.g. Nat. Weather Service)
- Criteria used in the certification
Notes
- Relevant experts may include:
- science domain experts that know the science applications.
- instrument experts that know the subtleties of the observation mechanism.
- algorithm experts that know the variations in retrievals.
- process experts that know the subtleties of the processing implementation.
- data format experts that know handling of for example HDF4 vs HDF5.
- How best to capture the "gotchas" potentially introduced each step along the way?
- Suitability of data usage
- Mapping observations (e.g. variables) to appropriate science focus areas.