Interagency Data Stewardship/LifeCycle/WorkshopReport

From Earth Science Information Partners (ESIP)

Revision as of 10:37, August 12, 2009 by Rduerr (talk | contribs) (New page: ==Introduction== #Crisp definition of provenance and context (use OAIS & USGCRP definitions) (Ruth and Mark) #Vision and objectives (Ruth & Mark) #Two page turning stories of success and f...)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Jump to:navigation, search

Introduction

Crisp definition of provenance and context (use OAIS & USGCRP definitions) (Ruth and Mark)
Vision and objectives (Ruth & Mark)
Two page turning stories of success and failure (John, Tom, Dennis, Al Fleig)

Summary of the state of the art

Digital Library work (Nancy, Bruce Wilson’s friends)
Records management work (Tom Ross? Mark to find someone; Bruce Barkstrom to try also)
Workflow community work (Frew? Brian? Paolo?)
Earth science data community (Bruce Wilson, Ruth, Ken Casey?)

Lessons learned (use real examples and use cases wherever possible) (Bruce Barkstrom, Ruth)

From past experiences (call to ESIP members)
1. Impact of Levels of Service
2. Need to frequently touch things to see that they are still OK
3. Levels of abstraction in storage infrastructure makes it easier to reconstruct provenance (insulate provenance from changes in environment)
4. Direct pointers to things are always a problem in the long term (pointers to pointers are a good pattern)
5. Monocultures are expensive and fragile
6. Events need to be tracked
7. The Level 0 data and its definition is sometimes a neglected part (ties back to identifiers for data, metadata, and processing methods)
8. It is much harder and costs much more to try and retrofit provenance/context rather than capture it as it happens – better to have something there than nothing even if it has to be refined later (note this is ongoing)
9. Capture of provenance is an asymptotic process needing judgment to determine what is good enough (0 is not good enough)
10. Understanding what the software did is hard – need to keep the code and the parameters
11. Risky to get rid of the higher level data when it may have taken 100’s of FTE years of work to generate it in the first place
12. Legal community may drive this for climate data (e.g., Bruce and Mann Hockeystick)
From other disciplines

Critical issues (describing the issues) – Curt Tilmes

Agency policies and constraints (Rama, NOAA?)
Carrots for the research and funded data management community
Legacy issues
1. Chain of custody is important
2. How to understand 100,000 lines of code
3. Capturing tacit knowledge (Bruce Barkstrom has a good example)
4. Managing the cal/val data is extremely important
5. Getting all the provenance/context information into an archive somewhere
6. Developing mechanisms to link all that info together in a preservable way
Identifiers
Archive Information Packages (AIPs)

Implementation Strategy – Mark Parsons

Data center plans
1. gap analysis vs provenance and context
2. determine relevant levels of service
3. test and assess against user needs
4. fill gaps as necessary, develop relevant procedures and policies, document unfillable gaps
5. producer community training (best practices cookbook)
Develop more comprehensive preservation and stewardship strategy
1. Develop guidelines for producers (NARA, LOC to help?)
2. Develop completeness metric – provenance readiness levels
Roles and responsibilities - if you touch the data you document it
Define Levels of Service and recommend approaches to determining a LOS – following standards can help raise the LOS

Research agenda (Bruce Wilson)

Providing the tools to automate the process is necessary to ensure reproducibility
Virtualization – how far can you go, what can be reproduced that way
Identifier issue (Ruth) – keeping track of these over time
How to create AIP’s (standards, practices)

Appendix A: More war stories (Al Fleig)

Retrieved from "https://wiki.esipfed.org/w/index.php?title=Interagency_Data_Stewardship/LifeCycle/WorkshopReport&oldid=22151"