Difference between revisions of "Preservation Ontology"

From Earth Science Information Partners (ESIP)
m
 
(2 intermediate revisions by 2 users not shown)
Line 1: Line 1:
Back to [[Preservation and Stewardship]]
+
Back to [[Preservation and Stewardship]] or [[Semantic Web]]
  
 
== About ==
 
== About ==
Line 47: Line 47:
 
**** other data, which will usually be numbers or strings that might be needed to understand the Earth science data.  For example, if one were dealing with a digital glacier photo, a file with the latitude and longitude of each pixel might help a user understand the image
 
**** other data, which will usually be numbers or strings that might be needed to understand the Earth science data.  For example, if one were dealing with a digital glacier photo, a file with the latitude and longitude of each pixel might help a user understand the image
 
*** It is not clear what to do about some data that doesn't necessarily appear in the published data files, but is critical to the meaning of the Earth science data.  Examples include calibration coefficients, radiative transfer parameters, or even supporting data, such as temperature and humidity profiles used as input (or, to back up one more step, the radiosonde or satellite data used to produce numerical weather forecasts)
 
*** It is not clear what to do about some data that doesn't necessarily appear in the published data files, but is critical to the meaning of the Earth science data.  Examples include calibration coefficients, radiative transfer parameters, or even supporting data, such as temperature and humidity profiles used as input (or, to back up one more step, the radiosonde or satellite data used to produce numerical weather forecasts)
 +
*** The narrowest definition I can think of for items that would need to be incorporated into objects referenced as production provenance are items "touched" by production scripts.  This view assumes that production is done by discrete jobs that ingest or produce discrete digital files.  To make this statement specific, it means that we would only include
 +
**** Files ingested or produced
 +
**** Source code identified uniquely as to version
 +
**** Script or Control Flow
 +
**** Job residues, meaning the record of running an individual job including as attributes or fields
 +
***** Start Date and Time
 +
***** Wall Clock End Date and Time
 +
***** Production Environment Configuration
 +
*** This narrow definition does include ancillary files, such as those that contain calibration coefficients, radiative transfer model coefficients, and so on.  It does not include any documentation or other material.
 +
*** It may be worth separating the production provenance into routine production runs and runs done for validation, which may incorporate additional data sources.  For example, if the main production were creating stratospheric ozone profiles, validation processing might include jobs that compare the routinely produced satellite data with in situ measurements from aircraft and balloons.
 +
*** This narrowest definition does not include data used to derive ancillary file contents, such as calibration measurements that are obtained before a satellite instrument launch and that are then processed to create regression fits to calibration algorithms that would use the regression coefficients as gains.  We need a discussion of how those kinds of ancillary data fit into our categorization.
 +
*** This narrowest definition also does not include the human context of plans and procedures.  For example, a Data Management Plan or a Submission Agreement provides guidance for production.  At a highly detailed level, a project Work Breakdown Schedule might also be relevant.  Perhaps these are context information -- or perhaps they would be included in a broader definition of "provenance".
 
* Start with [http://openprovenance.org/ Open Provenance Model]
 
* Start with [http://openprovenance.org/ Open Provenance Model]
 
** The [http://www.w3.org/2005/Incubator/prov/wiki/Main_Page W3C Provenance Incubator Group] has a good comparison of different provenance models and their mappings in the [http://www.w3.org/2005/Incubator/prov/wiki/Provenance_Vocabulary_Mappings Provenance Vocabulary Mappings].
 
** The [http://www.w3.org/2005/Incubator/prov/wiki/Main_Page W3C Provenance Incubator Group] has a good comparison of different provenance models and their mappings in the [http://www.w3.org/2005/Incubator/prov/wiki/Provenance_Vocabulary_Mappings Provenance Vocabulary Mappings].
Line 63: Line 75:
 
=== Capturing preservation information ===
 
=== Capturing preservation information ===
  
* capturing data production provenance
+
* capturing provenance of data production runs.
* capture data product context
+
* capturing provenance of a climate analysis conducted by scientist on workstation.
 +
* capture data product context.
  
 
=== Using preservation information ===
 
=== Using preservation information ===

Latest revision as of 05:36, June 29, 2011

Back to Preservation and Stewardship or Semantic Web

About

Supporting the long-term preservation of Earth system science data and information is core of the Data Preservation and Stewardship Cluster. As such, a formalism is needed to codify the information. A future-looking approach is to leverage semantic web technologies to capture the knowledge representation and to enable flexible usage of this information.

Roadmap

We would like to start with practical use cases in preservation modeling, then work on a small and manageable model, and continually increment on a working design. Ideally, we should converge with support the information identified in the Provenance and Context Content Standard.

Some major steps planned:

1. Define what we would like from a "Preservation Ontology"

2. Define practical Use Cases

3. Extract high-level requirements

4. Adopt/reuse existing provenance models

5. Extend model with more focus on Earth science preservation

  • Include provenance and context

6. Infuse model into data systems

  • some aspects covered by ACCESS project(s)?

7. Updates and Refinements

Approach

Following the steps from the roadmap, the approach will also include the following:

  • Follow closely the Provenance and Context Content Standard.
    • e.g. processing history, data formats used, product development history, algorithms, ATBDs, product tools, QA, validation, software.
    • On preservation, do we want to model with provenance and/or context?
    • Mark P on provenance vs context:
      • provenance is for reproducibility.
      • context is for someone use the information for something else.
    • Bruce B additional comments
      • context information is probably usefully divided into
        • documentation, which is text, images, and tables that provide information about the Earth science data
        • other data, which will usually be numbers or strings that might be needed to understand the Earth science data. For example, if one were dealing with a digital glacier photo, a file with the latitude and longitude of each pixel might help a user understand the image
      • It is not clear what to do about some data that doesn't necessarily appear in the published data files, but is critical to the meaning of the Earth science data. Examples include calibration coefficients, radiative transfer parameters, or even supporting data, such as temperature and humidity profiles used as input (or, to back up one more step, the radiosonde or satellite data used to produce numerical weather forecasts)
      • The narrowest definition I can think of for items that would need to be incorporated into objects referenced as production provenance are items "touched" by production scripts. This view assumes that production is done by discrete jobs that ingest or produce discrete digital files. To make this statement specific, it means that we would only include
        • Files ingested or produced
        • Source code identified uniquely as to version
        • Script or Control Flow
        • Job residues, meaning the record of running an individual job including as attributes or fields
          • Start Date and Time
          • Wall Clock End Date and Time
          • Production Environment Configuration
      • This narrow definition does include ancillary files, such as those that contain calibration coefficients, radiative transfer model coefficients, and so on. It does not include any documentation or other material.
      • It may be worth separating the production provenance into routine production runs and runs done for validation, which may incorporate additional data sources. For example, if the main production were creating stratospheric ozone profiles, validation processing might include jobs that compare the routinely produced satellite data with in situ measurements from aircraft and balloons.
      • This narrowest definition does not include data used to derive ancillary file contents, such as calibration measurements that are obtained before a satellite instrument launch and that are then processed to create regression fits to calibration algorithms that would use the regression coefficients as gains. We need a discussion of how those kinds of ancillary data fit into our categorization.
      • This narrowest definition also does not include the human context of plans and procedures. For example, a Data Management Plan or a Submission Agreement provides guidance for production. At a highly detailed level, a project Work Breakdown Schedule might also be relevant. Perhaps these are context information -- or perhaps they would be included in a broader definition of "provenance".
  • Start with Open Provenance Model
  • Explore if possible to map some of the preservation model information to ISO 19115 - Metadata for Geographic Data?

Use Cases

Here are some initial grouping place holders:

Capturing preservation information

  • capturing provenance of data production runs.
  • capturing provenance of a climate analysis conducted by scientist on workstation.
  • capture data product context.

Using preservation information

  • provenance for reproducibility
  • comparison of production runs from two granules.
  • context for reuse in other domains

Model

tbd

Infusion

tbd

References