NetCDF, HDF, and ISO Metadata

From Earth Science Information Partners (ESIP)
Revision as of 07:44, July 10, 2013 by Ted.Habermann (talk | contribs) (→‎Identifiers And References)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

NetCDF data files can contain metadata for the entire file (global attributes) or for particular variables (variable attributes). This information is structured as collections of parameter/value pairs. NcML provides a representation of this information in simple attributes:

<attribute name="attribute_name" value="attribute_value" type="attribute_type"/>

The authors of the netCDF files are free to choose any attribute names. Several conventions exist to help facilitate interoperability by prescribing specific attribute names (e.g. the NetCDF Attribute Convention for Dataset Discovery, and the NetCDF Climate-Forecast Metadata Conventions).

The combination of conventions and this simple structure serve quite well in a number of situations. However, as the relationship between netCDF and ISO metadata (mostly ISO 19139) has been explored and developed (see ncISO), a number of documentation requirements have emerged that are not easily addressed in this structure. Many of the limitations of the netCDF structure are related to the fact that attribute and group names must be unique within a particular context (global or variable). This limitation poses a significant obstacle to many documentation needs. It can lead to attribute names that include information that is more appropriate as a property of an object rather than as part of the name. For example, the creator_name attribute may be more natural as a name attribute in a person object that has a role of creator.

We have explored two possible approaches for adding more structure to metadata in netCDF and HDF files

  1. netCDF4 and HDF5 introduce the idea of User Defined Types (UDT) which are essentially structures that can be defined by users to address specific needs. These types can be used for variables or attributes and can exist in arrays.
  2. netCDF4 and HDF5 also allow grouping of attributes and groups. This is the groups-of-groups (GoG) approach.

In order to use either of these approaches to add structure to the metadata, the community must identify and agree on types that make sense for ISO-compliant attributes. Several proposals are outlined on this page using the GOG approach.

Identifiers And References

The focus of datafiles is on the data they contain. The metadata in the datafile usually becomes interesting when there is an unexpected observation or some other problem. For example, assume that an unexpected change is identified in an on-going product that is produced regularly from various input sources and processing steps. An obvious question is "did the sources or the processing change?". The information required to answer this question could be voluminous and difficult to capture in a granule. This can be a good time to use references to metadata components rather than complete documentation. In this case unique identifiers could be used for sources and processing steps that would not change unless the source or processing changed. Data analysis tools could compare these identifiers across the unexpected change in the data. If they don't change, this type of change could be eliminated as a source of the change in the data. If a change in identifiers occurred, the user could follow the references to understand the details. This approach minimizes extraneous and redundant information while providing a link to that information when it is needed.

In order to support this capability, the groups and UDTs proposed here all include a UUID field that is expected to hold an optional unique identifier for the component. When the granule metadata is transformed into ISO this identifier can be used to construct the correct internal or external reference.

Object Types

Each proposed object includes an attribute called objectType that gives the type of the object. These types are defined in the ISO metadata model.

Online Resources

Online Resource

All data users benefit from connections between metadata records and external Web resources. The critical information in such links is the URL for the resource. Providing bare URLs worked in the past (to some extent) when URLs and the content they pointed to were simple and relatively consistent. Today more complex links to a wide variety of resource types are ubiquitous. Users need information along with the URL to understand what the purpose of the link is and what might happen when the click on it.

NetCDF with the Attribute Convention for Data Discovery includes two url's: creator_url and publisher_url. These are strings which give URL's but no explanatory information. This makes it difficult for users to understand and use these resources. The ISO approach to Resources online resources includes a name and description for the link which allows users to understand what the link does. The standard also includes a function for the online resource that allows grouping of multiple resources with similar functions.

The CI_OnlineResource object is one of the most straightforward and useful objects in the ISO Standards. As the Figure shows, it is made up of six strings. An alternative is to write the same information in NcML in a CI_OnlineResources group. This approach includes information to help users and applications understand the purpose of the link and allows straightforward validation and translation into ISO-compliant XML:

<group name="onlineResource">
    <attribute name="uuid" value="UUID"/>
    <attribute name="objectType" value="acdd:CI_OnlineResource"/>
    <attribute name="linkage" value="http://earthdata.nasa.gov/"/>
    <attribute name="protocol" value="http"/>
    <attribute name="applicationProfile" value="Web Browser"/>
    <attribute name="name" value="EOSDIS - Earth Data Website"/>
    <attribute name="description" value="Access to data and information on the NASA Earth Observing System"/>
    <attribute name="function" value="information"/>
</group>

Dates

Dates are critical elements of many kinds of metadata and they can have many roles. For example, a cited resource can have a creation_date, a publication date, and a revision date. NetCDF with the Attribute Convention for Data Discovery includes three types of dates: date_created, date_issued, and date_modified in which the type of the date is part of the attribute name. This means a separate standard name is required for each type of date.

ISO dates include a ISO 8601 string that can describe a single date, a date and time, a range of dates, or a time period as well as a type for the date (see ISO 19115 and 19115-2 CodeList Dictionaries for values). An ISO-compliant date group looks like:

<group name="date">
   <attribute name="date" value="2012-02-29"/>
   <attribute name="dateType" value="creation"/>
</group>

Citations

The CF conventions include a global attribute called reference that is defined as "Published or web-based references that describe the data or methods used to produce it.". The ISO Standards include a more complete CI_Citation structure that is used throughout the standards. Citations are clearly an important part of the metadata. The ISO citations include two required attributes: title and date and the date can have several types (see codeLists). The current proposal for a citation group is (the values of the attributes show the xPaths to the equivalent ISO content):

<group name="citation">
    <attribute name="uuid" value="UUID"/>
    <attribute name="objectType" value="acdd:CI_Citation"/>
    <attribute name="title" value="Title of cited item"/>
    <attribute name="identifier" value="Identifier of cited item"/>
    <attribute name="edition" value="Edition of cited item/>
    <group name="date">
        <attribute name="date" value="Date associated with cited item"/>
        <attribute name="dateType" value="Type of date associated with cited item"/>
    </group>
    <group name="citedResponsibleParty">
        <attribute name="uuid" value="UUID"/>
        <attribute name="objectType" value="acdd:CI_ResponsibleParty"/>
        <attribute name="Name of responsible individual"/>
        <attribute name="organisationName" value="Name of responsible organisation"/>
        <attribute name="electronicMailAddress" value="Email of responsible party"/>
        <group name="onlineResource">
            <attribute name="uuid" value="UUID"/>
            <attribute name="objectType" value="acdd:CI_OnlineResource"/>
            <attribute name="linkage" value="http://earthdata.nasa.gov/"/>
            <attribute name="protocol" value="http"/>
            <attribute name="applicationProfile" value="Web Browser"/>
            <attribute name="name" value="EOSDIS - Earth Data Website"/>
            <attribute name="description" value="Access to data and information on the NASA Earth Observing System"/>
            <attribute name="function" value="information"/>
        </group>
    </group>
</group>

Note that this example includes only one date and only one citedResponsibleParty. Suffixes(_1) on the end of these names can be used to ensure uniqueness of the group names. In cases with more than one date or citedResponsibleParty, the the suffixes _2, _3, _N would be used.

People

NetCDF with the Attribute Convention for Data Discovery includes three types of people: creators, publishers, and contributors with the following properties:

Type Name URL email role
Creator X X X
Publisher X X X
Contributor X X

The situation with people is a good example of the problem of attribute name "overload". The roles of the people are included in the names of the attributes and the attribute prefixes provide a de facto grouping mechanism, i.e. all attributes that start with creator are related to the same person (actually the attribute institution is an exception to this "rule". It is the institution of the creator, but it is not named creator_institution). If we wanted to document a contact for the metadata, it would require at least three new attributes: metadata_contact_name, _url, and _email, and perhaps _institution. The same is true for any other roles that might require identification.

CI_ResponsibleParty

The task of identifying all people that are associated with a dataset in this structure is very difficult. In fact, there is no approach that is guaranteed to work. There is also no guarantee of consistency in the descriptions of the people that are identified, they can have different information (e.g. there is currently no way to associate a url or email with a contributor).

In the ISO Standard, people and organizations are described using the CI_ResponsibleParty class that includes names, positions, a variety of contact information, and a role for the person or organization. This allows consistent descriptions of responsible parties in any role included in the codelist shown in the Figure (which can be extended).

This example shows a possible CI_ResponsibleParty group identified as a contact:

<group name="contact">
    <attribute name="uuid" value="UUID"/>
    <attribute name="objectType" value="acdd:CI_ResponsibleParty"/>
    <attribute name="role" value="pointOfContact"/>
    <attribute name="individualName" value="Ted Habermann/>
    <attribute name="organisationName" value="The HDF Group"/>
    <attribute name="electronicMailAddress" value="thabermann@hdfgroup.org"/>
    <group name="onlineResource">
        <attribute name="uuid" value="UUID"/>
        <attribute name="objectType" value="acdd:CI_OnlineResource"/>
        <attribute name="linkage" value="http://www.hdfgroup.org"/>
        <attribute name="name" value="HDF Home Page"/>
        <attribute name="description" value="The HDF Group provides a unique suite of technologies and supporting services that make possible the management of large and complex data collections. Its mission is to advance and support HDF (Hierarchical Data Format) technologies and ensure long-term access to HDF data."/>
        <attribute name="function" value="information"/>
    </group>
</group>

Keywords

MD Keywords.png

NetCDF with the Attribute Convention for Data Discovery includes a keyword attribute which contains a comma separated list of keywords or phrases. There is also a keyword_vocabulary attribute that gives the name of the keyword vocabulary. This works well if only one keyword vocabulary is being used, but this is unusual. The NASA Global Change Master Directory includes keywords in ten categories, all of which can be useful in many documentation situations. The netCDF approach to citing these keywords makes it difficult to use more than one.

The ISO approach to keywords (similar to most others) allows different types of keywords (see codeLists) as well as a citation to the source for the keywords. In this example the acdd:keyword element includes a comma separated list of keywords with the same type and thesaurusName.

<group name="descriptiveKeywords">
    <attribute name="uuid" value="UUID"/>
    <attribute name="objectType" value="acdd:MD_Keyword"/>
    <attribute name="type" value="theme"/>
    <attribute name="keyword" value="Spectral/Engineering > Sensor Characteristics, Spectral/Engineering"/>
    <group name="thesaurusName">
        <attribute name="uuid" value="UUID"/>
        <attribute name="objectType" value="acdd:CI_Citation"/>
        <attribute name="title" value="NASA Global Change Master Directory (GCMD) Earth Science Keywords, Version 6.0"/>
        <group name="date">
            <attribute name="objectType" value="acdd:CI_Date"/>
            <attribute name="date" value="2011-06-06"/>
            <attribute name="dateType" value="revision"/>
        </group>
    </group>
</group>

Note that this example includes thesaurusName which is a citation group as described above.

Extents

The ISO Standards combine temporal and spatial extents into one object that is available to describe extents of datasets, quality reports, responsibilities, sources, and several other items. An EX_Extent group for a global dataset covering 2000 through 2009 looks like:

<group name="extent">
    <attribute name="uuid" value="UUID"/>
    <attribute name="objectType" value="acdd:EX_Extent"/>
    <attribute name="description" value="This is a global dataset"/>
    <attribute name="geographicIdentifier" value="world"/>
    <attribute name="westBoundLongitude" value="-180.0" type="float"/>
    <attribute name="eastBoundLongitude" value="180.0" type="float"/>
    <attribute name="southBoundLongitude" value="-90.0" type="float"/>
    <attribute name="northBoundLongitude" value="90.0" type="float"/>
    <attribute name="beginPosition" value="2000-01-01"/>
    <attribute name="endPosition" value="2009-12-31"/>
</group>

Lineage

ISO 19115-2 Lineage UML

The CF Conventions include a 'history' attribute which provides an audit trail for modifications to the original data. The concept is that well-behaved generic netCDF filters will automatically append their name and the parameters with which they were invoked to the global history attribute of an input netCDF file. The conventions recommend that each line begin with a timestamp indicating the date and time of day that the program was executed. The CF Conventions also include a 'source' attribute which includes the name and version of the model used to create the file or a type of instrument used to make the observations.

The netCDF Lineage Model includes timestamps, algorithm names and command line arguments, and, possibly, instrument types. This may be adequate for describing the lineage of simple model results or observational files, but many recent datasets have complex lineage that includes various versions of multiple sources, data that may come from dynamic web services, and various versions of multiple processes and algorithms applied in sequence by scientific workflows. The ISO Lineage model provides a significant step towards lineage descriptions that include many of these items.

The UML diagram on the right illustrates the main objects in the ISO Lineage Model: Sources and Process Steps. Each of these is described by a collection of properties and they are connected by the source, output, and sourceStep roles in the middle of the diagram. In practice these connecting roles are implemented as identifier references to sources and processSteps that are defined elsewhere in the metadata record. This practice needs to be considered in the implementation of sources and processSteps in NcML.

We also need to consider the granularity of the lineage description. If the entire process algorithm is being considered or described as a single step, all of the sources are associated with this single step and the process of describing connections between sources and specific processSteps is simplified considerably. In that case, the focus is on unambiguous identification of the sources.


Sources

As indicated in the UML diagram, an ISO source includes a description, a citation, and references to the processStep(s) in which it is used. In this example those references are shown as a comma-separated set of UUIDs:

<group name="source">
    <attribute name="uuid" value="UUID"/>
    <attribute name="objectType" value="acdd:LE_Source"/>
    <attribute name="description" value="A source in the processing of this dataset"/>
    <group name="sourceCitation">
        <attribute name="uuid" value="UUID"/>
        <attribute name="objectType" value="acdd:CI_Citation"/>
        <attribute name="title" value="The title of a source in the processing of this dataset"/>
        <attribute name="identifier" value="A unique identifier or filename for the source"/>
        <group name="date">
            <attribute name="objectType" value="acdd:CI_Date"/>
            <attribute name="date" value="A date associated with the source"/>
            <attribute name="dateType" value="a type for the date i.e. publication"/>
        </group>
        <group name="citedResponsibleParty">
            <attribute name="uuid" value="UUID"/>
            <attribute name="objectType" value="acdd:CI_ResponsibleParty"/>
            <attribute name="role" value="The role of this responsibleParty i.e. originator"/>
            <attribute name="organisationName" value="The Name of the responsible organization"/>
        </group>
    </group>
    <attribute name="sourceStep" value="UUID,UUID,UUID"/>
</group>

Process Steps

ProcessSteps can be more complicated than sources if they include processing or algorithm information (see UML). Implementing them using the group constructs described here is rather straightforward. Once again, the references to the source(s) and output(s) is implemented as a comma separated list of UUIDS:

<group name="processStep">
    <attribute name="uuid" value="UUID"/>
    <attribute name="objectType" value="acdd:LE_ProcessStep"/>
    <attribute name="description" value="A description of the processing step"/>
    <attribute name="dateTime" value=""/>
    <group name="processor">
        <attribute name="uuid" value="UUID"/>
        <attribute name="objectType" value="acdd:CI_ResponsibleParty"/>
        <attribute name="role" value="processor"/>
        <attribute name="individualName" value="The name of a processing person"/>
        <attribute name="organisationName" value="The name of a processing organization"/>
        <attribute name="positionName" value="The name of a processing position"/>
        <attribute name="electronicMailAddress" value="The email address of the processor"/>
    </group>
    <attribute name="source" value="UUID,UUID,UUID"/>
    <group name="processingInformation">
       <attribute name="uuid" value="UUID"/>
        <attribute name="objectType" value="acdd:LE_Processing"/>
        <attribute name="identifier" value="A unique identifier for the processing"/>
        <group name="algorithm">
            <attribute name="description" value="A brief description of the algorithm used in this step"/>
            <group name="citation">
                <attribute name="uuid" value="UUID"/>
                <attribute name="objectType" value="acdd:CI_Citation"/>
                <attribute name="title" value="The title of the algorithm document"/>
                <attribute name="identifier" value="A unique identifier for the algorithm"/>
                <group name="date_1">
                    <attribute name="objectType" value="acdd:CI_Date"/>
                    <attribute name="date" value="A date associated with the algorithm"/>
                    <attribute name="dateType" value="The type of the date"/>
                </group>
                <group name="citedResponsibleParty">
                    <attribute name="uuid" value="UUID"/>
                    <attribute name="objectType" value="acdd:CI_ResponsibleParty"/>
                    <attribute name="role" value="The role of a responsible party"/>
                    <attribute name="individualName" value="The name of a responsible party"/>
                </group>
            </group>
        </group>
    </group>
    <attribute name="output" value="UUID,UUID,UUID"/>
</group>