Difference between revisions of "Interagency Data Stewardship/Citations/provider guidelines"

Revision as of 19:16, August 9, 2011

Back to Preservation and Stewardship / Citations

Introduction and Summary

Data citation is an evolving but increasingly important scientific practice. We see several important purposes of data citation:

To aid scientific reproducibility through direct, unambiguous reference to the precise data used in a particular study. (This is the paramount purpose and also the hardest to achieve).
To provide fair credit for data creators or authors, data stewards, and other critical people in the data production and curation process.
To ensure scientific transparency and reasonable accountability for authors and stewards.
To aid in tracking the impact of data set through reference in scientific literature.
To help data authors verify how their data are being used.

Nevertheless, data are rarely cited formally in practice. There are many reasons for this discrepancy, but part of the problem is that there is a lack of consistent recommendations on how to cite and reference a data set and on how to actually construct a proper data citation. We need consistent guidelines on how to create data citations for Earth science data. Then data stewards need to work closely with data providers and science teams to define citation content and provide clear instructions to users on how data sets should be cited.

Current recommendations by ESIP members for citing their data range from casual acknowledgement within the text of a paper to formal and specific citations within the references section of the paper with unique and persistent digital identifiers. At the same time, various international organizations have been working to develop formal guidelines for data citation. Examples of these include:

International Polar Year How to Cite a Data Set
DataCite—a consortium of libraries and related organizations working to define a citation approach around DOIs. They have defined a DataCite Metadata Scheme for the Publication and Citation of Research Data, Version 2.0, January 2011 describing myriad elements that could potentially be included in a citation.
A new CODATA Task Group on Data Citation Standards and Practices in collaboration with the International Council for Scientific and Technical Information.
DataVerse Network Project—an approach from the social science community using a Handle locator and “Universal Numerical Fingerprint” as a unique identifier.
GEOSS Science and Technology Committee is also working on guidelines by building from the IPY guidelines. See their draft

The ESIP Preservation and Stewardship cluster has examined these and other current approaches and has found that they are generally compatible and useful, but they do not entirely meet all the purposes of Earth science data citation. Indeed, we believe it is currently impossible to fully satisfy the requirement of scientific reproducibility in all situations. We do believe, however, that applying a reasonably rigorous approach, coupled with good version tracking, comprehensive documentation, and due diligence on the part of data stewards, can provide a useful and precise citation for the great majority of Earth science data most of the time.

In general, data sets should be cited like books. (Used here is the author-date style described in "Chicago Manual of Style, 15th Edition".) When users cite data, they need to use the style dictated by their publishers, but by providing an example, data stewards can give users all the important elements that should be included in their citations of data sets. Data stewards need to work closely with data providers and science teams to develop the actual content of the citation to ensure that each data citation unambiguously refers to the data that was utilized for a particular work.

The core required elements of a citation are

Author(s)--the people or organizations responsible for the intellectual work to develop the data set. The data creators.
Release Date--when the particular version of the data set was first made available for use (and potential citation) by others.
Title--the formal title of the data set
Version--the precise version of the data used. Careful version tracking is critical to accurate citation.
Archive and/or Distributor--the organization distributing or caring for the data, ideally over the long term.
Locator/Identifier--this could be a URL but ideally it should be a persistant service that resolves to the current location of the data in question. The Digital Object Identifier is currently the most broadly adopted service for persistently identifying and locating whole data collections (as opposed to individual files or granules), although other identifier/locator services, such as ARKs and Handles, could be used.
Access Date and Time--because data can be dynamic and changeable in ways that are not always reflected in release dates and versions, it is important to indicate when on-line data were accessed.

Additional fields can be added as necessary to credit other people and institutions, etc. Additionally, it is important to provide a scheme for users to indicate the precise subset of data that were used. This could be the temporal and spatial range of the data, the types of files used, a specific query id, or other ways of describing how the data were subsetted.

An example citation:

Zwally, H.J., R. Schutz, C. Bentley, J. Bufton, T. Herring, J. Minster, J. Spinhirne, and R. Thomas. 2003. GLAS/ICESat L1A Global Altimetry Data V018, 15 October to 18 November 2003. National Snow and Ice Data Center. Data set accessed 2011-07-21 at doi:10.3334/NSIDC/gla01.

Detailed Citation Content

These citation guidelines help data stewards define and maintain precise, persistent citations for data they manage. The guidelines build from the IPY Guidelines and are compatible with the DataCite Metadata Scheme for the Publication and Citation of Research Data, Version 2.2, July 2011. The Cluster will continue to work with the CODATA Task Group, the GEOSS STC, DataCite, and others to ensure that our guidelines remain compatible with broader community practice. The approach defined here could be viewed as an extension or detailed profile of existing approaches.

The citation should include the following elements as appropriate. Mappings to the "Citation Information" section of the Federal Geographic Data Committee (FGDC) Content Standard for Digital Geospatial Metadata (CSDGM) ("FGDC-STD-001-1998") and the "DataCite Metadata Scheme Ver. 2.2 (July 2011)" are included. Each of the elements are described in more detail below.

Citation Element	FGDC CSDGM field	DataCite Metadata Scheme ID and Property
Author or Creator*	idinfo > citation > citeinfo > "origin"	2 Creator*
Release Date*	idinfo > citation > citeinfo > "pubdate" and sometimes "othercit"	5 PublicationYear*
Title*	idinfo > citation > citeinfo > "title" and possibly "edition"	3 Title*
Version*		15 Version
Archive and/or Distributor*	idinfo > citation > citeinfo > "publish"	4 Publisher*
Locator, Identifier, or Distribution Medium*	idinfo > citation > citeinfo > "othercit" or "onlink"	1 Identifier*
Access Date and Time*	not applicable	8 Date
Subset Used	not applicable	not applicable but may be captured in 8 Date or 11 AlternateIdentifier
Editor or Other Important Role	idinfo > citation > citeinfo > "origin"	7 Contributor
Publication Place	idinfo > citation > citeinfo > "pubplace"	not applicable
Distributor, Associate Archive, or other Institutional Role	idinfo > citation > citeinfo > "othercit"	7 Contributor?
Data Within a Larger Work	idinfo > citation > citeinfo > "othercit" or "lworkcit"	12 RelatedIdentifier?

*Mandatory (if applicable)

Mandatory Content

Author

This is name of the individual(s) or organization(s) whose intellectual work, such as a particular field experiment or algorithm, led to the creation of the data set. This is sometimes called the data creator. We prefer the term author because of its implied intellectual effort and its conventional use in citing traditional works. The archive, in close collaboration with data providers, needs to determine who deserves to receive credit and accept responsibility for the data set. Similarly, the archive needs to work closely with the provider to define the appropriate level of aggregation for the data set.

In some cases, the data set authors may have also published a paper describing the data in great detail. These sort of data papers should be encouraged, and both the paper and the data set should be cited when the data are used.

In addition to the data author, there may be "editors" or other roles that could be included in the citation. Other minor roles could be credited elsewhere in data documentation.

Doe, J. and R. Roe. 2001. The FOO Data Set. The FOO Data Center. doi:10.xxxx/notfoo.547983. Accessed 1 May 2011.

A particular group or organization may sometimes be the author.

The FOO Working Group. 2001. The FOO Data Set. The FOO Data Center. doi:10.xxxx/notfoo.547983. Accessed 1 May 2011.

If the data set is a collection of several smaller, independent data sets, the individual data sets would have their own specific citations with author, but the whole collection would not have an author. The collection would likely have an editor or compiler, though. See "Data Within a Larger Work"

Doe, J. (compiler) 2001. The FOO Collection. The FOO Data Center. doi:10.xxxx/notfoo.547983. Accessed 1 May 2011.

Release Date

For a completed data set, the release date is simply the year of release. A more precise date can be used if needed.

Doe, J. and R. Roe. 2001. The FOO Data Set. The FOO Data Center. doi:10.xxxx/notfoo.547983. Accessed 1 May 2011.

If detailed versioning information is lacking for a data set it may be appropriate to try and capture when updates occurred. For a data set that is updated infrequently or on an irregular basis, list the first year of released followed by "updated" with the current update information.

Doe, J. and R. Roe. 2001, updated 2005. The FOO Occasionally Updated Data Set. The FOO Data Center. doi:10.xxxx/notfoo.547983. Accessed 1 May 2011.

For an ongoing data set that is updated on a regular or continual basis, list the first year of release followed by the last update. Updates could occur annually or more frequently.

Doe, J. and R. Roe. 2001, updated daily. The FOO Time Series Data Set. The FOO Data Center. doi:10.xxxx/notfoo.547983. Accessed 1 May 2011.

A note on updates vs. new versions:

Ongoing updates to a time series do change the content of the data set, but they do not typically constitute a new version or edition of a data set. New versions typically reflect changes in sampling protocols, algorithms, quality control processes, etc. Both a new version and an update may be reflected in the release date. The version number should also be included. See also the "Note on Versioning and Locators."

Doe, J. and R. Roe. 2001, updated daily. The FOO Time Series Data Set. Version 3.2. The FOO Data Center. doi:10.xxxx/notfoo.547983. Accessed 1 May 2011.

Title

This is the formal title of the data set. It may also include version or edition information, but should be carefully controlled. A better alternative is to track version information independent of the title. Note this is the title of the data set, not the project or a related publication. It is important for the data set to have an identity and title of its own.

Doe, J. and R. Roe. 2001. The FOO Data Set. The FOO Data Center. doi:10.xxxx/notfoo.547983. Accessed 1 May 2011.

Archive or Distributor

This is the organization that maintains and manages the release or distribution of the data set. There is often an implied responsibility for stewardship of the data set. This role is often considered that of a data "publisher," but we avoid that term because it may imply proprietary restrictions or unintended assertions of quality or peer-review. These issues are beyond the intent of data citation. DataCite describes this role well: "The entity that holds, archives, publishes, prints, distributes, releases, issues, or produces the resource. This property will be used to formulate the citation, so consider the prominence of the role."

Doe, J. and R. Roe. 2001. The FOO Data Set. The FOO Data Center. doi:10.xxxx/notfoo.547983. Accessed 1 May 2011.

Version

Careful versioning and documentation of version changes are central to enabling accurate citation. Data stewards need to track and clearly indicate precise versions as part of the citation for any version greater than 1. It may be appropriate to track major and minor versions. See the note below on versions and locators.

Doe, J. and R. Roe. 2001. The FOO Data Set. Version 2.3. The FOO Data Center. doi:10.xxxx/notfoo.547983. Accessed 1 May 2011.

Locator or Distribution Medium

If there is one fixed medium, indicate it. For example, CD-ROM, DVD.

Doe, J. and R. Roe. 2001. The FOO Data Set. The FOO Data Center. CD-ROM.

More typically, data are available over the internet or through multiple digital media options. Then it is necessary to include a persistant reference to the location of the data. Often this is through a standard URL, but the mutability and lack of persistence of URLs is a known problem. The assignment of a unique and persistent locator offers a more consistent approach for managing current location information. Any reasonably persistent location service such as 'DOIs, 'ARKs, Handles, 'PURLs etc. is acceptable. Scientific publishers, however, are most familiar with the DOI. Furthermore, Thomson Reuters is exploring the inclusion of' DOIs and possibly 'ARKs in their citation indices, such as the Web of Science. This suggests that while any persistent locator can be used in a citation, 'DOIs and possibly 'ARKs are more likely to be accepted by publishers. Note 'DOIs, 'ARKs, and Handles are useful to locate full data sets or collections. Other locators and identifiers may be more appropriate for locating individual records or files. Best practice is that the suffix of the identifier does not include a reference to the archive in case the data are moved from the original location where the persistent identifier was assigned initially.

Doe, J. and R. Roe. 2001. The FOO Data Set. Version 2.3. The FOO Data Center. doi:10.xxxx/notfoo.547983. Accessed 1 May 2011.

See additional discussion in the "Subset Used" section as well as the "Note on Locators and Versions" and "Note on Locators versus Identifiers".

Access Date

Because data can be dynamic and changeable in ways that are not always reflected in release dates and versions, it is important to indicate when on-line data were accessed. This is in keeping with common citation practice for online documents and other resources.

Doe, J. and R. Roe. 2001. The FOO Data Set. Version 2.3. The FOO Data Center. doi:10.xxxx/notfoo.547983. Accessed 1 May 2011.

Note on Locators vs. Identifiers

Identity and location are often confused or conflated. While one can often use an item's location to identify it or an item's identity to locate it, the concepts are distinct. This is easily conceived when we consider a human example. A name such as "John James Doe" (Office Manager at the FOO Data Center) is an identifier. An address such as "123 Main St. #201, Peoria, IL, 12345-1234, USA" is a locator.

The locator might work as an identifier, because you might find John in his office, but he may also have retired and there is a new Office Manager who plays the same role but is not the same person. Similarly, you may be able to locate John based on his name and title, but what happens if he is telecommuting this week and is in Poughkeepsie not Peoria? It is similar with digital objects. One might be able to identify a data set by its URL, for example, but there is no guarantee that what is at that URL today is the same as what was there yesterday,

Confusingly, a Digital Object Identifier is a locator. It is a Handle based scheme whereby the steward of the digital object registers a location (typically a URL) for the object. There is no guarantee that the object at the registered location will remain unchanged. Consider a continually updated data time series, for example. Indeed, the advantage is to separate location information from other information about the resource. The location is not part of the resource description so that location can be managed independently, thereby enabling URL changes, migrations to other archives, etc.

While it is desirable to uniquely identify the cited object, it has proven extremely challenging to identify whether two data sets or data files are scientifically identical. Furthermore, Earth science data sets can be highly mutable. At this point, we must rely on location information combined with other information such as author, title, and version to uniquely identify data used in a study. See also the "Note on Versioning and Locators" and the section on "Subset Used."

Note on Versioning and Locators

The key to making registered locators, such as DOIs, ARKS, or Handles, work unambiguously to identify and locate data sets is through careful tracking and documentation of versions. Individual stewards and data centers will need to develop and follow their own practices, but here are some suggestions on how to handle different data set versions relative to an assigned locator. This is an initial suggestion based on work at the National Snow and Ice Data Center exploring many different kinds of data sets and production patterns.

Track major_version.minor_version.[archive_version]. Archive version is only used internally to track any change in the archive. Major and minor version are exposed publicly and reflect actual scientific change in the data.
Individual stewards need to determine which are major vs. minor versions and describe the nature and file/record range of every version. Typically, something that affects the whole data set like a reprocessing would be considered a major version.
Assign unique locators to major versions.
Old locators for retired versions should be maintained and point to some appropriate web site that explains what happened to the old data if they were not archived.
A new major version leads to the creation of a new collection-level metadata record that is distributed to appropriate registries. The older metadata record should remain with a pointer to the new version and with explanation of the status of the older version data.
Major and minor version should be listed in the recommended citation.
Minor versions should be explained in documentation, ideally in file-level metadata.
Ongoing additions to an existing time series need not constitute a new version. This is one reason for capturing the date accessed when citing the data.
Applying UUIDs, or other locators, to individual files upon ingest aids in tracking minor versions and historical citations.

@@ Line 31: / Line 31: @@
 * Archive and/or Distributor--the organization distributing or caring for the data, ideally over the long term.
 * Locator/Identifier--this could be a URL but ideally it should be a persistant service that resolves to the current location of the data in question. The Digital Object Identifier is currently the most broadly adopted service for persistently identifying and locating whole data collections (as opposed to individual files or granules), although other identifier/locator services, such as ARKs and Handles, could be used.
-* Time, date accessed--because data can be dynamic and changeable in ways that are not always reflected in release dates and versions, it is important to indicate when on-line data were accessed.
+* Access Date and Time--because data can be dynamic and changeable in ways that are not always reflected in release dates and versions, it is important to indicate when on-line data were accessed.
 Additional fields can be added as necessary to credit other people and institutions, etc. Additionally, it is important to provide a scheme for users to indicate the precise subset of data that were used. This could be the temporal and spatial range of the data, the types of files used, a specific query id, or other ways of describing how the data were subsetted.
 An example citation:
-<blockquote>Zwally, H.J., R. Schutz, C. Bentley, J. Bufton, T. Herring, J. Minster, J. Spinhirne, and R. Thomas. 2003. GLAS/ICESat L1A Global Altimetry Data V018, 15 October to 18 November 2003.  Data set accessed 2011-07-21 at doi:10.3334/NSIDC/gla01.</blockquote>
+<blockquote>Zwally, H.J., R. Schutz, C. Bentley, J. Bufton, T. Herring, J. Minster, J. Spinhirne, and R. Thomas. 2003. ''GLAS/ICESat L1A Global Altimetry Data V018, 15 October to 18 November 2003''. National Snow and Ice Data Center. Data set accessed 2011-07-21 at doi:10.3334/NSIDC/gla01.</blockquote>
 ==Detailed Citation Content==
@@ Line 49: / Line 49: @@
    | '''DataCite Metadata Scheme ID and Property'''
 |-
-   | Data Author or Creator*
+   | Author or Creator*
    | idinfo > citation > citeinfo > [http://www.fgdc.gov/csdgmgraphical/ideninfo/citat/citinfo/origin.htm "origin"]
    | 2 Creator*
 |-
-   | Release date*
+   | Release Date*
    | idinfo > citation > citeinfo > [http://www.fgdc.gov/csdgmgraphical/ideninfo/citat/citinfo/pubdate.htm "pubdate"] and sometimes [http://www.fgdc.gov/csdgmgraphical/ideninfo/citat/citinfo/othcit.htm "othercit"]
    | 5 PublicationYear*
@@ Line 65: / Line 65: @@
    | 15 Version
 |-
-   | Archive or Distributor*
+   | Archive and/or Distributor*
    | idinfo > citation > citeinfo > [http://www.fgdc.gov/csdgmgraphical/ideninfo/citat/citinfo/publish.htm "publish"]
    | 4 Publisher*
 |-
-   | Distribution medium or locator*
+   | Locator, Identifier, or Distribution Medium*
    | idinfo > citation > citeinfo > [http://www.fgdc.gov/csdgmgraphical/ideninfo/citat/citinfo/othcit.htm "othercit"] or [http://www.fgdc.gov/csdgmgraphical/ideninfo/citat/citinfo/online.htm "onlink"]
     | 1 Identifier*
 |-
-   | Access date*
+   | Access Date and Time*
    | not applicable
    | 8 Date
 |-
-   | Subset used
+   | Subset Used
    | not applicable
    | not applicable but may be captured in 8 Date or 11 AlternateIdentifier
 |-
-   | Editor or other important role
+   | Editor or Other Important Role
    | idinfo > citation > citeinfo > [http://www.fgdc.gov/csdgmgraphical/ideninfo/citat/citinfo/origin.htm "origin"]
    | 7 Contributor
 |-
-   | Publication place
+   | Publication Place
    | idinfo > citation > citeinfo > [http://www.fgdc.gov/csdgmgraphical/ideninfo/citat/citinfo/pplace.htm "pubplace"]
    | not applicable

Difference between revisions of "Interagency Data Stewardship/Citations/provider guidelines"

Revision as of 19:16, August 9, 2011

Contents

Introduction and Summary

Detailed Citation Content

Mandatory Content

Author

Release Date

Title

Archive or Distributor

Version

Locator or Distribution Medium

Access Date

Suggested Content as Needed

Subset Used

Editor, Compiler, or other important role

Archive or Distributor Place

Distributor, Associate Archive, or other Institutional Role

Data Within a Larger Work

Note on Locators vs. Identifiers

Note on Versioning and Locators