Interagency Data Stewardship/LifeCycle/WorkshopReport

From Federation of Earth Science Information Partners
Revision as of 09:59, August 12, 2009 by Rduerr (talk | contribs)


Crisp definition of provenance and context (use OAIS & USGCRP definitions) (Ruth and Mark)

Provenance and context are two of the four components called out as Preservation Description Information in the OAIS Reference Model (OAIS-RM) - the other two are fixity and reference information. As defined by the OAIS-RM, provenance is

   "The information that documents the history of the Content Information.  This information tells the origin 
   or source of the Content Information, any changes that may have taken place since it was originated, and who 
   has had custody of it since it was originated.  Examples of Provenance Information are the principal investigator 
   who recorded the data, and the information concerning its storage, handling, and migration."

and context is

   "The information .that documents the relationships of the Content Information to its environment.  This includes
   why the Content Information was created and how it relates to other Content Information objects."

However, these definitions are broad and can be applied to any object, not just earth science data. As a result they do not concretely specify the information needed. Fortunately, a report from a joint workshop sponsored by NASA and NOAA through the USGCRP program does. This information includes:

  1. "Instrument/sensor characteristics including pre-flight or pre-operational performance measurements (e.g., spectral response, noise characteristics, etc.)
  2. Instrument/sensor calibration data and method
  3. Processing algorithms and their scientific basis, including complete description of any sampling or mapping algorithm used in creation of the product (e.g., contained in peer-reviewed papers, in some cases supplemented by thematic information introducing the data set or derived product)
  4. Complete information on any ancillary data or other data sets used in generation or calibration of the data set or derived product
  5. Processing history including versions of processing source code corresponding to versions of the data set or derived product held in the archive
  6. Quality assessment information
  7. Validation record, including identification of validation data sets
  8. Data structure and format, with definition of all parameters and fields
  9. In the case of earth based data, station location and any changes in location, instrumentation, controlling agency, surrounding land use and other factors which could influence the long-term record
  10. A bibliography of pertinent Technical Notes and articles, including refereed publications reporting on research using the data set
  11. Information received back from users of the data set or product

Vision and objectives (Ruth & Mark)

Two page turning stories of success and failure (John, Tom, Dennis, Al Fleig)

Summary of the state of the art

  1. Digital Library work (Nancy, Bruce Wilson’s friends)
  2. Records management work (Tom Ross? Mark to find someone; Bruce Barkstrom to try also)
  3. Workflow community work (Frew? Brian? Paolo?)
  4. Earth science data community (Bruce Wilson, Ruth, Ken Casey?)

Lessons learned (use real examples and use cases wherever possible) (Bruce Barkstrom, Ruth)

  1. From past experiences (call to ESIP members)
    1. Impact of Levels of Service
    2. Need to frequently touch things to see that they are still OK
    3. Levels of abstraction in storage infrastructure makes it easier to reconstruct provenance (insulate provenance from changes in environment)
    4. Direct pointers to things are always a problem in the long term (pointers to pointers are a good pattern)
    5. Monocultures are expensive and fragile
    6. Events need to be tracked
    7. The Level 0 data and its definition is sometimes a neglected part (ties back to identifiers for data, metadata, and processing methods)
    8. It is much harder and costs much more to try and retrofit provenance/context rather than capture it as it happens – better to have something there than nothing even if it has to be refined later (note this is ongoing)
    9. Capture of provenance is an asymptotic process needing judgment to determine what is good enough (0 is not good enough)
    10. Understanding what the software did is hard – need to keep the code and the parameters
    11. Risky to get rid of the higher level data when it may have taken 100’s of FTE years of work to generate it in the first place
    12. Legal community may drive this for climate data (e.g., Bruce and Mann Hockeystick)
  2. From other disciplines

Critical issues (describing the issues) – Curt Tilmes

  1. Agency policies and constraints (Rama, NOAA?)
  2. Carrots for the research and funded data management community
  3. Legacy issues
    1. Chain of custody is important
    2. How to understand 100,000 lines of code
    3. Capturing tacit knowledge (Bruce Barkstrom has a good example)
    4. Managing the cal/val data is extremely important
    5. Getting all the provenance/context information into an archive somewhere
    6. Developing mechanisms to link all that info together in a preservable way
  4. Identifiers
  5. Archive Information Packages (AIPs)

Implementation Strategy – Mark Parsons

  1. Data center plans
    1. gap analysis vs provenance and context
    2. determine relevant levels of service
    3. test and assess against user needs
    4. fill gaps as necessary, develop relevant procedures and policies, document unfillable gaps
    5. producer community training (best practices cookbook)
  2. Develop more comprehensive preservation and stewardship strategy
    1. Develop guidelines for producers (NARA, LOC to help?)
    2. Develop completeness metric – provenance readiness levels
  3. Roles and responsibilities - if you touch the data you document it
  4. Define Levels of Service and recommend approaches to determining a LOS – following standards can help raise the LOS

Research agenda (Bruce Wilson)

  1. Providing the tools to automate the process is necessary to ensure reproducibility
  2. Virtualization – how far can you go, what can be reproduced that way
  3. Identifier issue (Ruth) – keeping track of these over time
  4. How to create AIP’s (standards, practices)

Appendix A: More war stories (Al Fleig)