Interagency Data Stewardship/Identifiers
Back to Preservation and Stewardship
Identifiers Testbed Activities
These wiki pages are being used to chronicle the identifiers test bed activities of the Data Stewardship cluster
Proposal Text
This is the text related to the identifiers testbed from the proposal to the federation.
The Preservation and Stewardship Cluster and the NASA Technology Infusion Working Group have been considering permanent naming schemes for data products. These identifiers can serve as references in journal articles and must include versioning representations. Many naming options have been promoted, but the best choices for Earth science data require careful examination. Two datasets may differ only in format, byte order, data type, access method, etc., creating facets (dimensions) not relevant to classification schemes for books (Library of Congress, Dewey Decimal).
Ultimate Benefit: Permanent, unique names for data Federation data products.
Cost: $5K for Programmer 2 to setup a test archive where data can be retrieved via the candidate naming schemes, as identified by the Provenance and Stewardship Cluster.
Goals and Objectives
Unique and lasting data identifiers are needed for a wide range of purposes and at various scales from data sets, to collections of data sets, to individual files or data objects. Likewise, a large variety of identification schemes have been developed, each of which satisfies a subset of the overall needs and is most appropriately applied at various scales.
The ESIP preservation and stewardship cluster has recognized that as a consequence data centers will need to support multiple identification schemes and different identifiers at different scales. The purpose of this testbed activity is to test and demonstrate the applicability of selected schemes with a wide variety of earth science data types with the ultimate goal of recommending a suite for use by ESIP federation members...
At the 2009 AGU Fall meeting, a town hall meeting on Peer-Reviewed Data Publication and Other Strategies to Sustain Verifiable Science. Among other things, the meeting addressed data citations.
Citations can vary based on the bibliographic style, but typically includes some Data Type level metadata such as author, title, etc. In addition to the identifying metadata, there should be a unique, actionable, persistent identifier for the referenced Data Set. In addition, mechanisms to specifically identify the exact suite of granules used are needed, but do not yet exist. In any case, the data set identifier must have the following characteristics:
- unique
- There is only one canonical identifier for the data set.
- actionable
- The identifier can be used to resolve and ultimately locate the referenced data set.
- persistent
- The identifier included as a reference in the published science remains valid as long as the science it is based on remains valid. They must survive if a data set is moved from one archive to another, if the entire archive is taken over by a new organization, or if the data themselves are deleted (in that case, you should still be able to retrieve data set metadata about the data).
Data Sets to be used
- NSIDC's glacier photo collection (random place, time, source, with repeats occasionally)
- NSIDC's GLAS data ("Picket fence" spatial organization over short time periods spaced intermittently)
- GSFC's Ozone datasets (orbit data partitioned to keep all daylight data together; and profile data sets)
- GSFC's 3-D merged cloud data set
- NASA/Bruce B. - ERBE (long-time series from multiple satellites)
- NOAA/Bruce B. - Hurricane Ike collection (spatially organized data collection)
- ORNL - Luyssert data set (example of ways for tracking pieces from multiple different field datasets going into a synthesized product)
Identifier schemes
Identifier Characteristics Table provides a listing of the significant characteristics of each of the identifier schemes examined.
In this testbed, we will be using all of the schemas examined as part of the identifiers paper; however, we will tackle them sequentially in priority order as follows:
- DOI - Ruth to develop use cases (done)
- PURL - Curt to develop use cases
- UUID - Ruth to develop use cases
- OID - Bruce to develop use cases
- ARK
- XRI
- LSID
- Handles
- NOTE**: Aside from the first 3 schemes, the order given here is a draft - please update if you'd prefer to see your favorite scheme ranked higher on the list.
Planned Outcomes
- A paper discussing the utility of chosen identification schemes, and recommending one or more identification schemes to use for geospatial resources. See [1]
- A paper discussing operational considerations associated with assigning identifiers of each of nine identification schemes to at least one dataset and its components, and the impact of those considerations on the recommendations from the first paper.
- Further discussion about best practices related to the assignment / use of unique identifiers based on the experience gained from above activities.
Definitions
- Granularity
- The extent to which a Data Set can be broken down into individual granules. The size of individual granules in a DataSet. The granularity is typically chosen for each DataSet by the science team that generates it; hopefully such that it is neither so small that the number of granules is too large to work with, nor so big that it becomes awkward to work with a single granule. Striking such a balance is occasionally not feasible (see for example MODIS 5 min products).
- Granule
- A distinct referenceable amount of data within a Data Set, based on its granularity. Often this is equivalent to the smallest amount of data that has been identified as being a unique item in a catalog or inventory of items for this data set.
- There are two definitions that might be identified with the term "Granule"
1) In the original use of the term "Granule" in NASA's EOSDIS metadata, the term "Granule" was usually identified as "the smallest orderable item in an archive's inventory". As I recall the concern was that when data files were stored on magnetic tapes, the software could split the file into pieces, one of which would reside on one tape volume, while the second would reside on another. To keep the reference to the original file, the EOSDIS developers used the term "Granule". 2) A "Granule" might also refer to an OAIS RM "Dissemination Information Package" (DIP), which is a collection of information distributed in response to a user order. The OAIS RM identifies "Archive Information Units" (AIUs) that are the "atomic elements" of information in an archive. It seems reasonable to categorize the AIUs into different kinds of objects, such as Physical Objects, Digital Files, Relational Databases, or Job Residues. Assuming that the AIU objects are homogeneous, an archive can create a DIP made of heterogeneous objects that is what users receive. In some cases, the DIP's may be preassembled. [BRB - 2/16/2011]
- Data Type
- Data that have been processed in the same way: by the same general algorithm, with the same granularity, with the same format, etc. [This concept has been formalized by the NASA EOS project as an "Earth Science Data Type" or ESDT.]
- Data Version
- A concept incorporating the version of the algorithm producing specific granules, but also the versions of the inputs to that algorithm. [This is roughly represented by the NASA EOS term "Collection".] When data are reprocessed, they get a new Data Version, and create a new Data Set.
- Data Set
- A collection of granules of the same Data Type of the same Data Version. OK - this one NSIDC would have problems with since our systems currently deal with data sets that are versioned (i.e., our Data sets are your Data Types and there are multiple versions of a particular data set).
- Closed Data Set
- A Data Set that is no longer being updated.
- Open Data Set
- A Data Set that is currently being processed. Over time, it can be altered: current granules can be removed from it, or new granules can be added to it.
Use Cases
Please document your use case for a particular identifier scheme