Interagency Data Stewardship/LifeCycle/WorkshopReport

From Federation of Earth Science Information Partners
Revision as of 16:41, August 12, 2009 by Rduerr (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


The annotated outline from the ESIP meeting in Santa Barbara follows. Actual development of the report is being done in Google docs. The titles of each of the main sections in the outline will take you to the Google Earth document for that section. If you don't have access, then you either don't have a Google account or you aren't a member of this ESIP cluster. If you'd like access and these problems are fixed, send an email to rduerr at nsidc dot org and I'll have Google send you an invitation to view/edit these documents.


Introduction

Crisp definition of provenance and context (use OAIS & USGCRP definitions) (Ruth and Mark)

Vision and objectives (Ruth & Mark)

Two page burning stories of success and failure (John, Tom, Dennis, Al Fleig)

Summary of the state of the art

  1. Digital Library work (Nancy, Bruce Wilson’s friends)
  2. Records management work (Tom Ross? Mark to find someone; Bruce Barkstrom to try also)
  3. Workflow community work (Frew? Brian? Paolo?)
  4. Earth science data community (Bruce Wilson, Ruth, Ken Casey?)

Lessons learned (use real examples and use cases wherever possible) (Bruce Barkstrom, Ruth)

  1. From past experiences (call to ESIP members)
    1. Impact of Levels of Service
    2. Need to frequently touch things to see that they are still OK
    3. Levels of abstraction in storage infrastructure makes it easier to reconstruct provenance (insulate provenance from changes in environment)
    4. Direct pointers to things are always a problem in the long term (pointers to pointers are a good pattern)
    5. Monocultures are expensive and fragile
    6. Events need to be tracked
    7. The Level 0 data and its definition is sometimes a neglected part (ties back to identifiers for data, metadata, and processing methods)
    8. It is much harder and costs much more to try and retrofit provenance/context rather than capture it as it happens – better to have something there than nothing even if it has to be refined later (note this is ongoing)
    9. Capture of provenance is an asymptotic process needing judgment to determine what is good enough (0 is not good enough)
    10. Understanding what the software did is hard – need to keep the code and the parameters
    11. Risky to get rid of the higher level data when it may have taken 100’s of FTE years of work to generate it in the first place
    12. Legal community may drive this for climate data (e.g., Bruce and Mann Hockeystick)
  2. From other disciplines

Critical issues (describing the issues) – Curt Tilmes

  1. Agency policies and constraints (Rama, NOAA?)
  2. Carrots for the research and funded data management community
  3. Legacy issues
    1. Chain of custody is important
    2. How to understand 100,000 lines of code
    3. Capturing tacit knowledge (Bruce Barkstrom has a good example)
    4. Managing the cal/val data is extremely important
    5. Getting all the provenance/context information into an archive somewhere
    6. Developing mechanisms to link all that info together in a preservable way
  4. Identifiers
  5. Archive Information Packages (AIPs)

Implementation Strategy – Mark Parsons

  1. Data center plans
    1. gap analysis vs provenance and context
    2. determine relevant levels of service
    3. test and assess against user needs
    4. fill gaps as necessary, develop relevant procedures and policies, document unfillable gaps
    5. producer community training (best practices cookbook)
  2. Develop more comprehensive preservation and stewardship strategy
    1. Develop guidelines for producers (NARA, LOC to help?)
    2. Develop completeness metric – provenance readiness levels
  3. Roles and responsibilities - if you touch the data you document it
  4. Define Levels of Service and recommend approaches to determining a LOS – following standards can help raise the LOS

Research agenda (Bruce Wilson)

  1. Providing the tools to automate the process is necessary to ensure reproducibility
  2. Virtualization – how far can you go, what can be reproduced that way
  3. Identifier issue (Ruth) – keeping track of these over time
  4. How to create AIP’s (standards, practices)

Appendix A: More war stories (Al Fleig)