Difference between revisions of "Interagency Data Stewardship/LifeCycle/WorkshopReport"

From Federation of Earth Science Information Partners
m
 
Line 1: Line 1:
==Introduction==
+
__NOTOC__
===Crisp definition of provenance and context (use OAIS & USGCRP definitions) (Ruth and Mark)===
 
  
Provenance and context are two of the four components called out as Preservation Description Information in the OAIS Reference Model (OAIS-RM) - the other two are fixity and reference information. As defined by the OAIS-RM, provenance is
+
The annotated outline from the ESIP meeting in Santa Barbara follows.  Actual development of the report is being done in Google docs.  The titles of each of the main sections in the outline will take you to the Google Earth document for that section.  If you don't have access, then you either don't have a Google account or you aren't a member of this ESIP cluster.  If you'd like access and these problems are fixed, send an email to rduerr at nsidc dot org and I'll have Google send you an invitation to view/edit these documents.
  
    "The information that documents the history of the Content Information.  This information tells the origin
 
    or source of the Content Information, any changes that may have taken place since it was originated, and who
 
    has had custody of it since it was originated.  Examples of Provenance Information are the principal investigator
 
    who recorded the data, and the information concerning its storage, handling, and migration."
 
  
and context is
+
==[http://docs.google.com/Doc?docid=0AZXSHP7TCQaiZGh0aGt6N25fMGNjZHQ1aGRm&hl=en Introduction]==
 
+
===Crisp definition of provenance and context (use OAIS & USGCRP definitions) (Ruth and Mark)===
    "The information that documents the relationships of the Content Information to its environment.  This includes
 
    why the Content Information was created and how it relates to other Content Information objects."
 
 
 
However, these definitions are broad and can be applied to any object, not just earth science data.  As a result they do not concretely specify the information needed.  Fortunately, a report from a joint workshop sponsored by NASA and NOAA through the USGCRP program provides a definition of the information needed to ensure that earth science data is useful to the science community.  This information includes:
 
 
 
#"Instrument/sensor characteristics including pre-flight or pre-operational performance measurements (e.g., spectral response, noise characteristics, etc.)
 
#Instrument/sensor calibration data and method
 
#Processing algorithms and their scientific basis, including complete description of any sampling or mapping algorithm used in creation of the product (e.g., contained in peer-reviewed papers, in some cases supplemented by thematic information introducing the data set or derived product)
 
#Complete information on any ancillary data or other data sets used in generation or calibration of the data set or derived product
 
#Processing history including versions of processing source code corresponding to versions of the data set or derived product held in the archive
 
#Quality assessment information
 
#Validation record, including identification of validation data sets
 
#Data structure and format, with definition of all parameters and fields
 
#In the case of earth based data, station location and any changes in location, instrumentation, controlling agency, surrounding land use and other factors which could influence the long-term record
 
#A bibliography of pertinent Technical Notes and articles, including refereed publications reporting on research using the data set
 
#Information received back from users of the data set or product
 
 
 
With the exception of the data structure and format information, which more properly could be considered to be Representation Information in the OAIS-RM, this list defines the provenance and context information required in order for earth science data to be useful for climate studies.
 
  
 
===Vision and objectives (Ruth & Mark)===
 
===Vision and objectives (Ruth & Mark)===
 
===Two page burning stories of success and failure (John, Tom, Dennis, Al Fleig)===
 
===Two page burning stories of success and failure (John, Tom, Dennis, Al Fleig)===
  
==Summary of the state of the art==  
+
==[http://docs.google.com/Doc?docid=0AZXSHP7TCQaiZGh0aGt6N25fMWY0Njc4NjRo&hl=en Summary of the state of the art]==  
 
#Digital Library work (Nancy, Bruce Wilson’s friends)
 
#Digital Library work (Nancy, Bruce Wilson’s friends)
 
#Records management work (Tom Ross? Mark to find someone; Bruce Barkstrom to try also)
 
#Records management work (Tom Ross? Mark to find someone; Bruce Barkstrom to try also)
 
#Workflow community work (Frew? Brian? Paolo?)
 
#Workflow community work (Frew? Brian? Paolo?)
 
#Earth science data community (Bruce Wilson, Ruth, Ken Casey?)
 
#Earth science data community (Bruce Wilson, Ruth, Ken Casey?)
==Lessons learned (use real examples and use cases wherever possible) (Bruce Barkstrom, Ruth)==
+
==[http://docs.google.com/Doc?docid=0AZXSHP7TCQaiZGh0aGt6N25fMngzNTU4d2hw&hl=en Lessons learned] (use real examples and use cases wherever possible) (Bruce Barkstrom, Ruth)==
 
#From past experiences (call to ESIP members)
 
#From past experiences (call to ESIP members)
 
##Impact of Levels of Service
 
##Impact of Levels of Service
Line 53: Line 30:
 
##Legal community may drive this for climate data (e.g., Bruce and Mann Hockeystick)
 
##Legal community may drive this for climate data (e.g., Bruce and Mann Hockeystick)
 
#From other disciplines
 
#From other disciplines
==Critical issues (describing the issues) – Curt Tilmes==
+
==[http://docs.google.com/Doc?docid=0AZXSHP7TCQaiZGh0aGt6N25fM2Q4d2pzOGc0&hl=en Critical issues] (describing the issues) – Curt Tilmes==
 
#Agency policies and constraints (Rama, NOAA?)
 
#Agency policies and constraints (Rama, NOAA?)
 
#Carrots for the research and funded data management community
 
#Carrots for the research and funded data management community
Line 65: Line 42:
 
#Identifiers
 
#Identifiers
 
#Archive Information Packages (AIPs)
 
#Archive Information Packages (AIPs)
==Implementation Strategy – Mark Parsons==
+
==[http://docs.google.com/Doc?docid=0AZXSHP7TCQaiZGh0aGt6N25fNGNweHpnOGQy&hl=en Implementation Strategy] – Mark Parsons==
 
#Data center plans
 
#Data center plans
 
##gap analysis vs provenance and context
 
##gap analysis vs provenance and context
Line 77: Line 54:
 
#Roles and responsibilities - if you touch the data you document it
 
#Roles and responsibilities - if you touch the data you document it
 
#Define Levels of Service and recommend approaches to determining a LOS – following standards can help raise the LOS
 
#Define Levels of Service and recommend approaches to determining a LOS – following standards can help raise the LOS
==Research agenda (Bruce Wilson)==
+
==[http://docs.google.com/Doc?docid=0AZXSHP7TCQaiZGh0aGt6N25fNWY0OW00Z2hw&hl=en Research agenda] (Bruce Wilson)==
 
#Providing the tools to automate the process is necessary to ensure reproducibility
 
#Providing the tools to automate the process is necessary to ensure reproducibility
 
#Virtualization – how far can you go, what can be reproduced that way
 
#Virtualization – how far can you go, what can be reproduced that way
 
#Identifier issue (Ruth) – keeping track of these over time
 
#Identifier issue (Ruth) – keeping track of these over time
 
#How to create AIP’s (standards, practices)
 
#How to create AIP’s (standards, practices)
==Appendix A: More war stories (Al Fleig)==
+
==[http://docs.google.com/Doc?docid=0AZXSHP7TCQaiZGh0aGt6N25fNjhkY3FtOGdw&hl=en Appendix A: More war stories] (Al Fleig)==

Latest revision as of 16:41, August 12, 2009


The annotated outline from the ESIP meeting in Santa Barbara follows. Actual development of the report is being done in Google docs. The titles of each of the main sections in the outline will take you to the Google Earth document for that section. If you don't have access, then you either don't have a Google account or you aren't a member of this ESIP cluster. If you'd like access and these problems are fixed, send an email to rduerr at nsidc dot org and I'll have Google send you an invitation to view/edit these documents.


Introduction

Crisp definition of provenance and context (use OAIS & USGCRP definitions) (Ruth and Mark)

Vision and objectives (Ruth & Mark)

Two page burning stories of success and failure (John, Tom, Dennis, Al Fleig)

Summary of the state of the art

  1. Digital Library work (Nancy, Bruce Wilson’s friends)
  2. Records management work (Tom Ross? Mark to find someone; Bruce Barkstrom to try also)
  3. Workflow community work (Frew? Brian? Paolo?)
  4. Earth science data community (Bruce Wilson, Ruth, Ken Casey?)

Lessons learned (use real examples and use cases wherever possible) (Bruce Barkstrom, Ruth)

  1. From past experiences (call to ESIP members)
    1. Impact of Levels of Service
    2. Need to frequently touch things to see that they are still OK
    3. Levels of abstraction in storage infrastructure makes it easier to reconstruct provenance (insulate provenance from changes in environment)
    4. Direct pointers to things are always a problem in the long term (pointers to pointers are a good pattern)
    5. Monocultures are expensive and fragile
    6. Events need to be tracked
    7. The Level 0 data and its definition is sometimes a neglected part (ties back to identifiers for data, metadata, and processing methods)
    8. It is much harder and costs much more to try and retrofit provenance/context rather than capture it as it happens – better to have something there than nothing even if it has to be refined later (note this is ongoing)
    9. Capture of provenance is an asymptotic process needing judgment to determine what is good enough (0 is not good enough)
    10. Understanding what the software did is hard – need to keep the code and the parameters
    11. Risky to get rid of the higher level data when it may have taken 100’s of FTE years of work to generate it in the first place
    12. Legal community may drive this for climate data (e.g., Bruce and Mann Hockeystick)
  2. From other disciplines

Critical issues (describing the issues) – Curt Tilmes

  1. Agency policies and constraints (Rama, NOAA?)
  2. Carrots for the research and funded data management community
  3. Legacy issues
    1. Chain of custody is important
    2. How to understand 100,000 lines of code
    3. Capturing tacit knowledge (Bruce Barkstrom has a good example)
    4. Managing the cal/val data is extremely important
    5. Getting all the provenance/context information into an archive somewhere
    6. Developing mechanisms to link all that info together in a preservable way
  4. Identifiers
  5. Archive Information Packages (AIPs)

Implementation Strategy – Mark Parsons

  1. Data center plans
    1. gap analysis vs provenance and context
    2. determine relevant levels of service
    3. test and assess against user needs
    4. fill gaps as necessary, develop relevant procedures and policies, document unfillable gaps
    5. producer community training (best practices cookbook)
  2. Develop more comprehensive preservation and stewardship strategy
    1. Develop guidelines for producers (NARA, LOC to help?)
    2. Develop completeness metric – provenance readiness levels
  3. Roles and responsibilities - if you touch the data you document it
  4. Define Levels of Service and recommend approaches to determining a LOS – following standards can help raise the LOS

Research agenda (Bruce Wilson)

  1. Providing the tools to automate the process is necessary to ensure reproducibility
  2. Virtualization – how far can you go, what can be reproduced that way
  3. Identifier issue (Ruth) – keeping track of these over time
  4. How to create AIP’s (standards, practices)

Appendix A: More war stories (Al Fleig)