Difference between revisions of "Interagency Data Stewardship/Identifiers"

From Earth Science Information Partners (ESIP)
m
Line 21: Line 21:
 
At the 2009 AGU Fall meeting, a town hall meeting on [[Interagency_Data_Stewardship/2009AGUTownHall|Peer-Reviewed Data Publication and Other Strategies to Sustain Verifiable Science]].  Among other things, the meeting addressed data citations.
 
At the 2009 AGU Fall meeting, a town hall meeting on [[Interagency_Data_Stewardship/2009AGUTownHall|Peer-Reviewed Data Publication and Other Strategies to Sustain Verifiable Science]].  Among other things, the meeting addressed data citations.
  
Citations can vary based on the bibliographic style, but typically includes some Data Type level metadata such as author, title, etc.  In addition to the identifying metadata, there should be a unique, actionable, specific, persistent identifier for the referenced Data Set.
+
Citations can vary based on the bibliographic style, but typically includes some Data Type level metadata such as author, title, etc.  In addition to the identifying metadata, there should be a unique, actionable, persistent identifier for the referenced Data Set. In addition, mechanisms to specifically identify the exact suite of granules used are needed, but do not yet exist.  In any case, the data set identifier must have the following characteristics:
 +
 
  
 
;unique
 
;unique
Line 28: Line 29:
 
;actionable
 
;actionable
 
:The identifier can be used to resolve and ultimately locate the referenced data set.
 
:The identifier can be used to resolve and ultimately locate the referenced data set.
 
;specific
 
:The identifier can be used to resolve a specific set of granules.  (This is particularly difficult in the case of Open Data Sets.)
 
  
 
;persistent
 
;persistent
Line 64: Line 62:
  
 
;Granularity
 
;Granularity
:The extent to which a Data Set can be broken down into individual granules.  The size of individual granules in a DataSet.  The granularity is chosen for each DataSet such that it is neither so small that the number of granules is too large to work with, nor so big that it becomes awkward to work with a single granule.
+
:The extent to which a Data Set can be broken down into individual granules.  The size of individual granules in a DataSet.  The granularity is typically chosen for each DataSet by the science team that generates it; hopefully such that it is neither so small that the number of granules is too large to work with, nor so big that it becomes awkward to work with a single granule.  Striking such a balance is occasionally not feasible (see for example MODIS 5 min products).
  
 
;Granule
 
;Granule

Revision as of 22:49, April 11, 2010

Identifiers Testbed Activities

These wiki pages will be used to chronicle the identifiers test bed activities of the Data Stewardship cluster

Proposal Text

This is the text related to the identifiers testbed from the proposal to the federation.

The Preservation and Stewardship Cluster and the NASA Technology Infusion Working Group have been considering permanent naming schemes for data products. These identifiers can serve as references in journal articles and must include versioning representations. Many naming options have been promoted, but the best choices for Earth science data require careful examination. Two datasets may differ only in format, byte order, data type, access method, etc., creating facets (dimensions) not relevant to classification schemes for books (Library of Congress, Dewey Decimal).

Ultimate Benefit: Permanent, unique names for data Federation data products.

Cost: $5K for Programmer 2 to setup a test archive where data can be retrieved via the candidate naming schemes, as identified by the Provenance and Stewardship Cluster.

Goals and Objectives (feel free to update/modify/redirect if need be)

Unique and lasting data identifiers are needed for a wide range of purposes and at various scales from data sets, to collections of data sets, to individual files or data objects. Likewise, a large variety of identification schemes have been developed, each of which satisfies a subset of the overall needs and is most appropriately applied at various scales.

The ESIP preservation and stewardship cluster has recognized that as a consequence data centers will need to support multiple identification schemes and different identifiers at different scales. The purpose of this testbed activity is to test and demonstrate the applicability of selected schemes with a wide variety of earth science data types with the ultimate goal of recommending a suite for use by ESIP federation members...

At the 2009 AGU Fall meeting, a town hall meeting on Peer-Reviewed Data Publication and Other Strategies to Sustain Verifiable Science. Among other things, the meeting addressed data citations.

Citations can vary based on the bibliographic style, but typically includes some Data Type level metadata such as author, title, etc. In addition to the identifying metadata, there should be a unique, actionable, persistent identifier for the referenced Data Set. In addition, mechanisms to specifically identify the exact suite of granules used are needed, but do not yet exist. In any case, the data set identifier must have the following characteristics:


unique
There is only one canonical identifier for the data set.
actionable
The identifier can be used to resolve and ultimately locate the referenced data set.
persistent
The identifier included as a reference in the published science remains valid as long as the science it is based on remains valid. They must survive if a data set is moved from one archive to another, if the entire archive is taken over by a new organization, or if the data themselves are deleted (in that case, you should still be able to retrieve data set metadata about the data).

Data Sets to be used

  • NSIDC's glacier photo collection (random place, time, source, with repeats occasionally)
  • NSIDC's GLAS data ("Picket fence" spatial organization over short time periods spaced intermittently)
  • GSFC's Ozone datasets (orbit data partitioned to keep all daylight data together; and profile data sets)
  • GSFC's 3-D merged cloud data set
  • NASA/Bruce B. - ERBE (long-time series from multiple satellites)
  • NOAA/Bruce B. - Hurricane Ike collection (spatially organized data collection)
  • ORNL - Luyssert data set (example of ways for tracking pieces from multiple different field datasets going into a synthesized product)

Identifier schemes

Identifier Characteristics Table provides a listing of the significant characteristics of each of the identifier schemes examined.

In this testbed, we will be using all of the schemas examined as part of the identifiers paper; however, we will tackle them sequentially in priority order as follows:

  1. DOI - Ruth to develop use cases (done)
  2. PURL - Curt to develop use cases
  3. UUID - Ruth to develop use cases
  4. OID - Bruce to develop use cases
  5. ARK
  6. XRI
  7. LSID
  8. Handles
    • NOTE**: Aside from the first 3 schemes, the order given here is a draft - please update if you'd prefer to see your favorite scheme ranked higher on the list.

Definitions

Granularity
The extent to which a Data Set can be broken down into individual granules. The size of individual granules in a DataSet. The granularity is typically chosen for each DataSet by the science team that generates it; hopefully such that it is neither so small that the number of granules is too large to work with, nor so big that it becomes awkward to work with a single granule. Striking such a balance is occasionally not feasible (see for example MODIS 5 min products).
Granule
A distinct referenceable amount of data within a Data Set, based on its granularity.
Data Type
Data that have been processed in the same way: by the same general algorithm, with the same granularity, with the same format, etc. [This concept has been formalized by the NASA EOS project as an "Earth Science Data Type" or ESDT.]
Data Version
A concept incorporating the version of the algorithm producing specific granules, but also the versions of the inputs to that algorithm. [This is roughly represented by the NASA EOS term "Collection".] When data are reprocessed, they get a new Data Version, and create a new Data Set.
Data Set
A collection of granules of the same Data Type of the same Data Version.
Closed Data Set
A Data Set that is no longer being updated.
Open Data Set
A Data Set that is currently being processed. Over time, it can be altered: current granules can be removed from it, or new granules can be added to it.

Use Cases

Please document your use case for a particular identifier scheme

Development

Testbed Telecon Minutes

Other Resources