Interagency Data Stewardship/LifeCycle/Jan2011Meeting

From Earth Science Information Partners (ESIP)

ESIP 2011 Winter Meeting

Tuesday, January 4, 2011

2:00-3:30 Citation guidelines and identifiers =

Notes from Session
  • Encourage publishers to enforce citation requirements
    • Papers need a unique name and location (URL + URN)
    • Location independent- copies everywhere have the same ID. Can be made without internet access or naming authority
    • Unique locator- location is invariant and can always be found in at least one place
    • Citable identified- Same as unique locator but also accepted by publishers, reduces clutter from granule level citation
    • Scientifically unique identifier- possible to verify that contents are unchanged after format change/rearrangement - ensures that data does not get tampered with and remains untouched
    • Different id schemes were assessed based on technical value, user value, archive value, and existing usage in data centers.
      • UUID most promising for Unique identifier
      • Most are fine for Unique locators
      • DOI most suitable for Citable locator
      • No existing models are optimized for Scientifically unique identifier
      • Different schemes solve different problems, plan on supporting lots of identifiers continuously as they go in and out of service- Best recommendation: a UUID and DOI at minimum.
    • Also suggested: use UUI for collection identification only and relegate details to metadata.
    • Follow up plan- work to have UUID granules/files and DOI data sets set as NASA standards


4:00-5:30 Towards a Earth Science provenance/context content standard - Part I

Notes from Session
  • Review Earth science provenance/context requirements
    • Data extremes are often broken down into archival units
    • Controlled vocabulary needed for distinguishing data types (defines format, granularity, etc.)
    • Data versioning is more complicated than software versions- the same data from the same system but with different calibrations could have different version names
  • How to distinguish all individual granules?
    • Example: Using FOO satellite data, tag each granule w/ UUID then DOI for the whole collection of granules
    • Problem, corruption errors in archives results in deletion and replacement of the data. now experiment cannot be replicated
    • Providence information is retained for original data, even so data itself is deleted
    • If corrupted data is remade, it gets a new UUID...how does the reproduced experiment get cited?
    • Very messy problem- who made it? Does anyone deserve credit for reformatting it?
    • Is it possible to make it reproducible or to make it cite-able?
  • DOI and UUID have limitations
    • So consider a "process on demand" Dataset and an ephemeral "data transformation" web service
    • Can you look at data citations and determine if two researchers are using same data granules?
  • Begin to develop a plan for creating the standard
    • Should the federation develop citation guidelines and best practices for the use of identifiers?
    • Other organizations are already doing it, does ESIP need to as well?
    • ESIP should explicitly clarify roles and functions of identifiers for the organizations creating standards.
    • Establish principles on which the identity of data are assigned
  • Proposals for citation can be measured against this criteria
  • Be ready now to tell the scientific community how to cite data
  • Enabling citation to lead to reproducibility standard

CONCLUSIONS

  • Identify roles and functions of identity
  • Recommend which identifiers are appropriate
  • Guidelines how to cite ESIP data
  • Develop guidelines with recognition that its ongoing process that will continuously improve
  • Who to work with for this?
    • DATAcite, among others

Wednesday, January 5, 2011

  • 1:45-3:15 Towards a Earth Science provenance/context content standard - Part II
    • Complete plan for standards development
  • 3:45-5:15 Towards an Earth Science provenance/context ontology - Part I

Notes from Session

Two parts:

  1. Towards a provenance and context content standard (H. K. “Rama” Ramapriyan and John Moses.)
    1. Will data be able to be understood 20 years from now?
    2. We must create a standard to understand this content. We need to understand and document what content is essential for long-term preservation.
    3. A standard helps us determine what data is useful for long-term study. We can assess the compliance to this standard to determine the long-term utility of such data.
    4. We need two things for such a model: Provenance and Context.
      1. Provenance: The history of the content information: “Length”
      2. Context: The information that documents the relationship of the information to its environment: “Breadth”.
    5. Long-Term Archiving
      1. Must be operated in the simplest way possible, to meet user needs.
      2. Not only for present users, but future users as well.
      3. Scientists must be actively engaged in deciding which products to include and which to exclude from such an archive.
    6. Study recommends rigorous documentation of all aspects of data: numerical (e. g. data sets used) and more qualitative (e. g. descriptions of methods).
    7. Method and science based criteria would be useful for judging whether a collection is complete and will 'meet the needs of future scientists'.
    8. Instrument performance content criteria must be met. Criteria includes provenance information and anticipated use. Complete information including (among other things; these were mostly the items in bold):
      1. Instrument/sensor calibration method
      2. Instrument/sensor calibration data
        1. Bruce: Are there gaps in the information? Better planning required? Write down the artifacts, develop a plan.
      3. Processing algorithms and their scientific basis
      4. Sampling or mapping algorithms
      5. Data structure and format, w/ definitions of parameters and fields
      6. Processing history (incl. Station location and changes)
      7. Data quality assessment
      8. Validation record, identification of validation datasets
      9. Bibliography of research using the data
      10. All data sets used in generation or calibration of the derived product.
      11. Source code used in generation or calibration of the derived product.
        1. Nancy Hoebelheinrich: Thinking about this in terms of canonical usage or canonical users.
          1. Bruce Barkstrom: Attempt to formalize Nancy's point: Concept of designated user committee: Categories of designated users? Maybe we should decide what are the requirements of the knowledge base a particular user would need to have, and how much work they'd have to do to understand the information?
        2. Ted: A model is important, but there must be an understanding of the heterogeneity of this content. Metadata would tie a lot of this together.
      12. Considerations:
        1. Focus on digital earth science data useful for long term trending and climate studies. Focus on content standard: “what” not “how”. Breadth and depth (should users be able to easily understand, or should they have enough data to be regenerated at a later date?)
        2. Information sources. ← This is where the most work will be done over the next six months.
        3. Review groups (for validating/improving draft content standard)
          1. Tammy R. Walker: Which content should be made available to the public, and what content is unnecessary?
        4. Diversity. Different organizations have different requirements and schedules. Same standard on content may not fit all. General set of items as content could be specified as union of contents for different disciplines.
        5. Sanity Check. Standard should enable each of the parties (data creators, intermediaries and users) to perform their functions.
      13. Proposed Actions
        1. Generate integrated and prioritized list of contents
        2. Decide how much and how to elaborate information on content
        3. Decide how to proceed (each program/agency on its own? ESIP-wide? IEEE? ISO?)
        4. Develop timeline.
      14. ISO and/or IEEE development processes.
        1. John Moses: What about the CCSDS as a vehicle?
          1. Rama: Then again, we are looking with more breadth than simply satellite data.
        2. Rama: It seems like technical specification is the way to go...
        3. Bruce Barkstrom: We have a somewhat incoherent collection of things that we want to do. Spend six months, develop a list or spreadsheet of documents or artifacts, and the columns in the spreadsheets are designated user communities. It would give us a clarification about what kind of “nomenclature thickets” we're about to get into. Ability to tie these particular elements together, and categorize them to clarify. Going to a standard now is a big step. We're going to have plenty of work to do on the sources themselves. Working on them might give us a clue into representation.
        4. Ted Habermann: I would be willing to take this list (presented by John and Rama) and see where it would fit within the current framework of standards.


2. How are we going to represent/share such a standard?


  1. Towards an Earth Science Provenance/Context Ontology
    1. This discussion will focus on the representation of that provenance and context information.
    2. We are looking for interoperability, at least among ourselves, but hopefully with the rest of the world too.
    3. To speak with the rest of the world: What language are they using?
    4. Linked Data is about using the Web to connect data that wasn't previously linked.
  2. We must have ontologies, to organize entities and concepts into common heirarchies and precisely describe relationships between those entities and concepts.
    1. Identifiers are key. Semantic web uses URIs as identifiers. When two entities reference something that is the “same” (semantically equivalent), they use the same identifier.
  3. Open Provenance Model
    1. IPAW '06: Session on provenance standardization and interoperability.
    2. Workshop in Salt Lake City (2001) is when Open Provenance Model (OPM) v1.0 was released. Requirements:
      1. Allows provenance exchange
      2. Toold for building, sharing common model
      3. Define provenance in a precise, tech-agnostic manner
      4. support representation of provenance for any “thing
      5. allow multiple levels of description to co-exist
      6. define a core set of inference rules
        1. OPM is more about process provenance.
    3. Specifications have not been fully produced for all the layers in the architecture of OPM.
    4. Abstract model of OPM:
      1. Artifact
      2. Process
      3. Agent
    5. Several specifications: OPM, OPMX (XML Schema, defines xsd types for abstract model entities), OPMV (Vocabulary, Aims to reuse existing Semantic Web technologies and vocabularies as much as possible. Time is expressed as a property of artifact and process.), OPMO (OWL Ontology, extends OPMV), OPM4J.

Thursday, January 6, 2011

  • 10:30-12:00 Towards an Earth Science provenance/context ontology - Part II
    • Refine use cases
    • Complete plan to develop ES Provenance/Context Ontology

Notes on Session

  1. Some other thoughts
    1. We are on provenance right now, but hopefully we eventually be able to tackle justification and trust.
    2. We want to encompass formal description of all of the elements of provenance and context.
    3. We want more precise terms, not just “artifact” and “file”, but other terms to distinguish data granules, data levels, calibration, ancillary data, validation data, etc.
  2. Plan?
    1. Develop use cases for provenance/context applications.
    2. Grow in parrallel with provenance/context content standard
    3. Analyze existing work
    4. Work with other experts at ESIP (people in the next room!)
    5. Take advantage of ESIP Test Bed to try things out
      1. Bruce Barkstrom: We have three different basic communities. The provenance terminology that we have created does not necessarily apply to all of these communities.
        1. Curt Tilmes: You're describing one use case, I'm describing another. This is why we need to develop these use cases for the growth of Provenance/Context applications and terminology.
        2. Bruce: These models don't serve the people responsible for the planning, and the people interested in that planning. Think of Boeing and their blueprint structures. These people should be interested in this model, and should be served by it.
          1. John Moses: One of the things that would help here is to describe your target audience.
            1. Bruce: What is it that we need in order to preserve information? I worry about the algorithms, some of these algorithms are very inadequately described. What is the size of effort you have to put in to put this kind of use to the data?
      2. Curt: This plan needs to include shorter, achievable goals, alongside longer-term projects.


  • 1:30-3:00 Cluster business meeting
    • Chair/co-chair election - 15 min
    • Summarize results and plans from sessions ~ 30 min
    • Moving testbed activities forward ~ 30 min