ISO Lineage

From Earth Science Information Partners (ESIP)
Revision as of 16:01, September 29, 2017 by Jkozimor (talk | contribs)
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Tracking data sources and processing done to them is becoming increasing important as scientists seek to define trends and unexpected changes in the environment. Keeping track of data transformations and processing, generally termed lineage, is an important role of high-quality metadata. The ISO metadata standard provides a simple lineage model based on sources which are either used or produced in a series of process steps. This model can be helpful in many cases despite its simplicity. Sources and process steps are linked together to describe the lineage of a resource.

The Model

This Figure shows an overview of the ISO lineage model which links sources to process steps. Sources can be input to, or output (19115-2) from a process step. Each process step has associated processing and algorithm information (also added in 19115-2). These improvements make it important to use 19115-2 if you need good lineage descriptions. ISO 19115-2 Lineage UML. This Figure shows more detail in the UML model used by the ISO Standard to describe lineage. In some cases, a simple descriptive statement can describe the lineage effectively. In more complex cases, multiple sources and process steps might be required. The definitions of sources and processSteps are also shown in the UML. The capability to specify the spatial and temporal extent of the source and to describe the rationale for a process step are new in the ISO Standard. Note that each source can have any number of associated sourceSteps and that each processStep can have any number of sources (and outputs in ISO 19115-2).

Sources and Steps

The original ISO 19115 Source descriptions (LI_Source) were extended in 19115-2 to include several more elements. The LE_Source includes the following elements:

LE_Source
+ description[0..1]: CharacterString
+ scaleDenominator[0..1]: MD_RepresentativeFraction
+ sourceReferenceSystem[0..1]: MD_ReferenceSystem
+ sourceCitation[0..1]: CI_Citation
+ sourceExtent[0..*]: EX_Extent
+ processedLevel[0..1]: MD_Identifier
+ resolution[0..1]: LE_NominalResolution

and Process Steps include

LE_ProcessStep
+ description: CharacterString
+ rationale[0..1]: CharacterString
+ dateTime[0..1]: DateTime
+ processor[0..*] : CI_ResponsibleParty
+ processingInformation[0..*]: LE_Processing
+ report[0..*]: LE_ProcessStepReport

LE_Processing
+ identifier: MD_Identifier
+ softwareReference[0..*]: CI_Citation
+ procedureDescription[0..1]: CharacterString
+ documentation[0..*]: CI_Citation
+ runTimeParameters[0..1]: CharacterString
+ algorithm[0..*]: LE_Algorithm

LE_Algorithm
+ citation: CI_Citation
+ description: CharacterString

LE_ProcessStepReport
+ name: CharacterString
+ description[0..1]: CharacterString
+ fileType[0..1]: CharacterString

The ISO Lineage model is simple but is probably sufficient for many common processing scenarios. It may only provide summary information in complex processing scenarios. This is facilitated by the use of CI_Citations in LE_Sources, LE_Processing, and LE_Algorithm. The resources referenced by these citations can provide more detail when necessary.

XML Implementation

Implementing these relationships in XML can seem daunting. It is accomplished in the XML representation using ids and references. The LE_Source and LE_ProcessStep objects (the boxes in the UML) are implemented as independent children of the LI_Lineage object with unique identifiers and the relationships, the source, output, and sourceStep roles in between the boxes, are implemented as references.

The example shown below shows the lineage section of a DART metadata record. This DART dataset is made up of data from three different deployments. Each of these is listed as a source in the second part of the lineage section. Each source includes

   *an id (D165_1999, D165_2000, and D165_2001),
   *a spatial and temporal extent defined using a reference to a full description in a different part of the record (xlink:href="#Extent_D165_2001"), and
   *a sourceStep which is also defined by a reference to a full definition located in the first part of the lineage section (e.g. xlink:href="#Received_D165_2001).

The processing of each source is described in the first part of the lineage section. In this case, the process is the receipt of the the data by the archive. The processSteps include:

   *a brief description of the process
   *when it was done
   *who did it, defined by a reference to the seriesmetadataContact defined elsewhere in the record,
   *a reference to the source that was processed (gmd:source xlink:href="#D165_1999").

Note the use of id's within this record to identify sources and process steps and to make links between them.

<gmd:dataQualityInfo>
   <gmd:DQ_DataQuality>
      <gmd:scope>
         <gmd:DQ_Scope id="datasetScope">
            <gmd:level>
               <gmd:MD_ScopeCode codeList="./resources/codeList.xml#MD_ScopeCode" codeListValue="dataset"/>
            </gmd:level>
            <gmd:extent xlink:href="#boundingExtent"/>
         </gmd:DQ_Scope>
      </gmd:scope>
      <gmd:lineage xlink:title="Dart Bouy D165 Processing">
         <gmd:LI_Lineage uuid="95BD4CCC-D27D-8DE4-E040-0AC8C5BB43B64">
            <gmd:statement>
               <gco:CharacterString>Dart Bouy D165 Processing</gco:CharacterString>
            <gmd:statement>
            <gmd:processStep>
               <gmd:LI_ProcessStep id="Received_D165_1999">
                  <gmd:description>
                     <gco:CharacterString>Received edited data D165_1999-ed</gco:CharacterString>
                  </gmd:description>
                  <gmd:dateTime>
                     <gco:DateTime>2005-09-02T00:00:00</gco:DateTime>
                  </gmd:dateTime>
                  <gmd:processor xlink:href="#seriesMetadataContact"/>
                  <gmd:source xlink:href="#D165_1999"/>
               </gmd:LI_ProcessStep>
            </gmd:processStep>
            <gmd:processStep>
               <gmd:LI_ProcessStep id="Received_D165_2000">
                  <gmd:description>
                     <gco:CharacterString>Received edited data D165_2000-ed</gco:CharacterString>
                  </gmd:description>
                  <gmd:dateTime>
                     <gco:DateTime>2005-09-02T00:00:00</gco:DateTime>
                  </gmd:dateTime>
                  <gmd:processor xlink:href="#seriesMetadataContact"/>
                  <gmd:source xlink:href="#D165_2000"/>
               </gmd:LI_ProcessStep>
            </gmd:processStep>
            <gmd:processStep>
               <gmd:LI_ProcessStep id="Received_D165_2001">
                  <gmd:description>
                     <gco:CharacterString>Received edited data D165_2001-ed</gco:CharacterString>
                  </gmd:description>
                  <gmd:dateTime>
                     <gco:DateTime>2005-09-02T00:00:00</gco:DateTime>
                  </gmd:dateTime>
                  <gmd:processor xlink:href="#seriesMetadataContact"/>
                  <gmd:source xlink:href="#D165_2001"/>
               </gmd:LI_ProcessStep>
            </gmd:processStep>
            <gmd:source>
               <gmd:LI_Source id="D165_1999">
                  <gmd:description>
                     <gco:CharacterString>gov.noaa.ngdc.dart:D165_1999</gco:CharacterString>
                  </gmd:description>
                  <gmd:sourceExtent xlink:href="#Extent_D165_1999"/>
                  <gmd:sourceStep xlink:href="#Received_D165_1999"/>
               </gmd:LI_Source>
            </gmd:source>
            <gmd:source>
               <gmd:LI_Source id="D165_2000">
                  <gmd:description>
                     <gco:CharacterString>gov.noaa.ngdc.dart:D165_2000</gco:CharacterString>
                  </gmd:description>
                  <gmd:sourceExtent xlink:href="#Extent_D165_2000"/>
                  <gmd:sourceStep xlink:href="#Received_D165_2000"/>
               </gmd:LI_Source>
            </gmd:source>
            <gmd:source>
               <gmd:LI_Source   id="D165_2001">
                  <gmd:description>
                     <gco:CharacterString>gov.noaa.ngdc.dart:D165_2001</gco:CharacterString>
                  </gmd:description>
                  <gmd:sourceExtent xlink:href="#Extent_D165_2001"/>
                  <gmd:sourceStep xlink:href="#Received_D165_2001"/>
               </gmd:LI_Source>
            </gmd:source>
         </gmd:LI_Lineage>
      </gmd:lineage>
   </gmd:DQ_DataQuality>
</gmd:dataQualityInfo>

This XML shows parts of a lineage section for a CoastWatch Swath dataset.

<gmd:lineage>
  <gmd:LI_Lineage>
    <gmd:processStep>
      <gmd:LI_ProcessStep id="121">
        <gmd:description>
          <gco:CharacterString>
             * Ingest and calibrate: ingests raw satellite data to TeraScan data format.* Automatic navigation: corrects an ingested AVHRR pass file.
          </gco:CharacterString>
        </gmd:description>
        <gmd:dateTime gco:nilReason="Not complete"/>
        <gmd:processor>...</gmd:processor>
        <gmd:source xlink:href="#HRPT_AVHRR_L0"/>  <!-- 19115-2: input -->
        <gmd:source xlink:href="#HRPT_AVHRR_L1B"/> <!-- 19115-2: input -->
        <gmd:source xlink:href="#TDF_Temp"/>       <!-- 19115-2: output -->
      </gmd:LI_ProcessStep>
    </gmd:processStep>
    <gmd:processStep>
      <gmd:LI_ProcessStep id="122">
        ...
        <gmd:source xlink:href="#TDF_Temp"/>      <!-- 19115-2: input -->
        <gmd:source xlink:href="#SST_Cloud_TDF"/> <!-- 19115-2: output -->
      </gmd:LI_ProcessStep>
    </gmd:processStep>
    <gmd:processStep>...</gmd:processStep>
    <gmd:processStep>...</gmd:processStep>
    <gmd:source>
      <gmd:LI_Source id="HRPT_AVHRR_L1B">
        <gmd:description>
          <gco:CharacterString>
            HRPT is a live data feed as the spacecraft goes over a receiving stations.
          </gco:CharacterString>
        </gmd:description>
        <gmd:sourceCitation></gmd:sourceCitation>
        <gmd:sourceExtent>
          <gmd:EX_Extent>
            <gmd:temporalElement>
              <gmd:EX_TemporalExtent>
                <gmd:extent>
                  <gml:TimePeriod gml:id="tp_1030059.81238">
                    <gml:beginPosition>2003-11-10</gml:beginPosition>
                    <gml:endPosition/>
                  </gml:TimePeriod>
                </gmd:extent>
              </gmd:EX_TemporalExtent>
            </gmd:temporalElement>
          </gmd:EX_Extent>
        </gmd:sourceExtent>
        <gmd:sourceStep xlink:href="#121"/>
      </gmd:LI_Source>
    </gmd:source>
    <gmd:source>
      <gmd:LI_Source id="HRPT_AVHRR_L0">
      ...
      </gmd:LI_Source>
    </gmd:source>
    <gmd:source>...</gmd:source>
    <gmd:source>...</gmd:source>
    <gmd:source>...</gmd:source>
    <gmd:source>...</gmd:source>
  </gmd:LI_Lineage>
</gmd:lineage>

Lineage as a Component

The same inputs or processing chains are many times used to create a series of results or files. In some cases, it may make sense to describe these elements once and access them as components. There are many options for the granularity of this approach and the choice depends on how often the data sets and processing systems involved change. For example, satellite data processing involves at least four kinds of data in processing systems: other products, ancillary data, auxiliary data, and lookup tables. These inputs change on different schedules, e.g. input satellite products may change every cycle while global digital elevation models may never change. Repeating complete source information that rarely changes increases the size of the metadata records and can be distracting. Referencing these quasi-static sources instead of including them simplifies the metadata and focuses it on lineage items that change.

This example shows a schematic processStep that includes one of each source type (complete contents are only shown for the first source):

<gmd:lineage>
  <gmd:LI_Lineage>
    <gmd:processStep>
      <gmi:LE_ProcessStep>
        <gmd:description>Brief Text</gmd:description>
        <gmd:source>
          <gmi:LE_Source uuid="uniqueIdentifierForSource">
            <gmd:sourceCitation>
              <gmd:CI_Citation>
                <gmd:title>
                  <gmx:FileName src="InputFileURI">Unique product file name</gmx:FileName>
                </gmd:title>
                <gmd:date>
                  <gmd:CI_Date>
                    <gmd:date/>
                    <gmd:dateType>
                      <gmd:CI_DateTypeCode codeList="" codeListValue="creation">creation</gmd:CI_DateTypeCode>
                    </gmd:dateType>
                  </gmd:CI_Date>
                </gmd:date>
              </gmd:CI_Citation>
            </gmd:sourceCitation>
          </gmi:LE_Source>
        </gmd:source>
        <gmd:source uuid="uniqueIdentifierForSource">Ancillary Data</gmd:source>
        <gmd:source uuid="uniqueIdentifierForSource">Auxiliary Data</gmd:source>
        <gmd:source uuid="uniqueIdentifierForSource">Lookup Table</gmd:source>
        <gmi:processingInformation>
          <gmi:algorithm>
            <gmi:description>Brief Text</gmi:description>
            <gmi:citation>Citation</gmi:citation>
          </gmi:algorithm>
          <gmi:softwareReference>Citation</gmi:softwareReference>
          <gmi:documentation>Citation</gmi:documentation>
        </gmi:processingInformation>
        <gmi:output uuid="uniqueIdentifierForSource">Output Product</gmi:output>
      </gmi:LE_ProcessStep>
    </gmd:processStep>
  </gmd:LI_Lineage>
</gmd:lineage>

The example above shows how internal references can be used to simplify the XML representation of the lineage. If the inputs to the process step are uniquely identified, and a service exists that can provide the XML for a particular component, the sources can be referenced as external components to make a more compact lineage description:

<gmd:lineage>
  <gmd:LI_Lineage>
    <gmd:processStep>
      <gmi:LE_ProcessStep>
        <gmd:description>Brief Text</gmd:description>
        <gmd:source xlink:href="http://service/uniqueIdentifier" xlink:title="inputProduct"/>
        <gmd:source xlink:href="http://service/uniqueIdentifier" xlink:title="inputAncillaryData"/>
        <gmd:source xlink:href="http://service/uniqueIdentifier" xlink:title="inputAuxiliaryData"/>
        <gmd:source xlink:href="http://service/uniqueIdentifier" xlink:title="inputLookupTable"/>
        <gmi:processingInformation xlink:href="http://componentService/uniqueIdentifier" xlink:title="Standard Processing Information"/>
        <gmi:output xlink:href="http://service/uniqueIdentifier" xlink:title="outputProduct"/>
      </gmi:LE_ProcessStep>
    </gmd:processStep>
  </gmd:LI_Lineage>
</gmd:lineage>