Interagency Data Stewardship/LifeCycle/WorkshopReport
From Earth Science Information Partners (ESIP)
Introduction
- Crisp definition of provenance and context (use OAIS & USGCRP definitions) (Ruth and Mark)
- Vision and objectives (Ruth & Mark)
- Two page turning stories of success and failure (John, Tom, Dennis, Al Fleig)
Summary of the state of the art
- Digital Library work (Nancy, Bruce Wilson’s friends)
- Records management work (Tom Ross? Mark to find someone; Bruce Barkstrom to try also)
- Workflow community work (Frew? Brian? Paolo?)
- Earth science data community (Bruce Wilson, Ruth, Ken Casey?)
Lessons learned (use real examples and use cases wherever possible) (Bruce Barkstrom, Ruth)
- From past experiences (call to ESIP members)
- Impact of Levels of Service
- Need to frequently touch things to see that they are still OK
- Levels of abstraction in storage infrastructure makes it easier to reconstruct provenance (insulate provenance from changes in environment)
- Direct pointers to things are always a problem in the long term (pointers to pointers are a good pattern)
- Monocultures are expensive and fragile
- Events need to be tracked
- The Level 0 data and its definition is sometimes a neglected part (ties back to identifiers for data, metadata, and processing methods)
- It is much harder and costs much more to try and retrofit provenance/context rather than capture it as it happens – better to have something there than nothing even if it has to be refined later (note this is ongoing)
- Capture of provenance is an asymptotic process needing judgment to determine what is good enough (0 is not good enough)
- Understanding what the software did is hard – need to keep the code and the parameters
- Risky to get rid of the higher level data when it may have taken 100’s of FTE years of work to generate it in the first place
- Legal community may drive this for climate data (e.g., Bruce and Mann Hockeystick)
- From other disciplines
Critical issues (describing the issues) – Curt Tilmes
- Agency policies and constraints (Rama, NOAA?)
- Carrots for the research and funded data management community
- Legacy issues
- Chain of custody is important
- How to understand 100,000 lines of code
- Capturing tacit knowledge (Bruce Barkstrom has a good example)
- Managing the cal/val data is extremely important
- Getting all the provenance/context information into an archive somewhere
- Developing mechanisms to link all that info together in a preservable way
- Identifiers
- Archive Information Packages (AIPs)
Implementation Strategy – Mark Parsons
- Data center plans
- gap analysis vs provenance and context
- determine relevant levels of service
- test and assess against user needs
- fill gaps as necessary, develop relevant procedures and policies, document unfillable gaps
- producer community training (best practices cookbook)
- Develop more comprehensive preservation and stewardship strategy
- Develop guidelines for producers (NARA, LOC to help?)
- Develop completeness metric – provenance readiness levels
- Roles and responsibilities - if you touch the data you document it
- Define Levels of Service and recommend approaches to determining a LOS – following standards can help raise the LOS
Research agenda (Bruce Wilson)
- Providing the tools to automate the process is necessary to ensure reproducibility
- Virtualization – how far can you go, what can be reproduced that way
- Identifier issue (Ruth) – keeping track of these over time
- How to create AIP’s (standards, practices)