Interagency Data Stewardship/Citations/provider guidelines

From Earth Science Information Partners (ESIP)

Introduction

Data citation is an evolving but increasingly important scientific practice. We see several important purposes of data citation:

  • To provide fair credit for data creators or authors, data stewards, and other critical people in the data production and curation process.
  • To ensure reasonable accountability for authors and stewards.
  • To aid in tracking the impact of data set through refernce in scientific literature.
  • To help data authors verify how their data are being used.
  • To aid scientific reproducibility through direct, unambiguous connection to the precise data used in a particular study.

Nevertheless, data are rarely cited formally in practice. There are myriad reasons for this discrepancy, but part of the problem is that there is a lack of consistent recommendations on how to actually construct a proper data citation. It is a responsibility of data centers and stewards to provide clear and consistent recommendations on how to cite data they manage. Current recommendations by ESIP members for citing their data range from casual acknowledgement within the text of a paper to formal and specific citations with unique and persistent digital identifiers. At the same time, various international organizations have been working to develop formal guidelines for data citation. Examples of these include:

The ESIP Preservation and Stewardship has examined these and other current approaches and find that they are generally compatible and useful, but they do not entirely meet all the purposes of Earth science data citation. Indeed, we believe it may currently be impossible to fully satisfy the scientific reproducibility requirement in all situations. That said, we do believe a reasonably rigorous approach, coupled with good version tracking, comprehensive documentation, and due diligence on the part of data stewards can provide a useful and precise citation for the great majority of Earth science data most of the time.

These citation guidelines help data stewards define and maintain precise, persistent citations for data they manage. The guidelines build from the IPY Guidelines and are compliant with the DataCite metadata kernal. The Cluster will continue to work with the CODATA Task Group, the GEOSS STC, and others to ensure our guidelines remain compatible with broader community practice. The approach defined her could be viewed as an extension or detailed profile of existing approaches.

In general, data sets should be cited like books. Used here is the author-date system described in "Chicago Manual of Style, 15th Edition". When users cite data, they need to use the style dictated by their publishers, but by providing an example, data stewards can give users all the important elements they should include in their citations of data sets. Data stewards need to work with closely with data providers and science teams to develop the actual content of the citation.

Citation Content

The citation should include the following elements as appropriate. Mappings to the "Citation Information" section of the Federal Geographic Data Committee (FGDC) Content Standard for Digital Geospatial Metadata (CSDGM) ("FGDC-STD-001-1998") and the DataCite Metadata Scheme Ver. 2 (Jan. 2011) are included. Each of the elements are described in more detail below.

Citation Element FGDC CSDGM field DataCite Metadata Scheme ID and Property
Data Author or Creator* idinfo > citation > citeinfo > "origin" 2 Creator*
Publication date* idinfo > citation > citeinfo > "pubdate" and sometimes "othercit" 5 PublicationYear*
Title* idinfo > citation > citeinfo > "title" and possibly "edition" 3 Title*
Version* - Publisher* idinfo > citation > citeinfo > "publish" 4 Publisher*
Distribution medium or location* idinfo > citation > citeinfo > "othercit" or "onlink" 1 Identifier*
Access date* not applicable 8 Date
Editor or other important role idinfo > citation > citeinfo > "origin" 7 Contributor
Publication place idinfo > citation > citeinfo > "pubplace" not applicable
Distributor or associate publisher idinfo > citation > citeinfo > "othercit" 7 Contributor?
Subset used not applicable not applicable but may be captured in 8 Date or 11 AlternateIdentifier
Data within a larger work idinfo > citation > citeinfo > "othercit" or "lworkcit" 12 RelatedIdentifier?

Author

This is the individual(s) whose intellectual work, such as a particular field experiment or algorithm, led to the creation of the data set. This is sometimes called the data creator. We prefer the term author because of its implied intellectuual effort.

Doe, J. and R. Roe. 2001. The FOO Data Set. The FOO Data Center. doi:10.xxxx/notfoo.547983. Accessed 1 May 2011.

A particular group or organization may sometimes be the author.

The FOO Working Group. 2001. The FOO Data Set. The FOO Data Center. doi:10.xxxx/notfoo.547983. Accessed 1 May 2011.

If the data set is a collection of several smaller, independent data sets, the individual data sets would have their own specific citations with author, but the whole collection would not have an author. The collection would likely have an editor or compiler, though.

Doe, J. (compiler) 2001. The FOO Collection. The FOO Data Center. doi:10.xxxx/notfoo.547983. Accessed 1 May 2011.


Publication Date

For a completed data set, the publication date is simply the year of release. A more precise date can be used if needed.

Doe, J. and R. Roe. 2001. The FOO Data Set. The FOO Data Center. doi:10.xxxx/notfoo.547983. Accessed 1 May 2011.

If detailed versioning information is lacking for a data set it may be appropriate to try and capture when updates occured. For a data set that is updated infrequently or on an irregular basis, list the first year of publication followed by "updated" with the current update information.

Doe, J. and R. Roe. 2001, updated 2005. The FOO Occasionally Updated Data Set. The FOO Data Center. doi:10.xxxx/notfoo.547983. Accessed 1 May 2011.

For an ongoing data set that is updated on a regular or continual basis, list the first year of publication followed by the last update. Updates could occur annually or more frequently.

Doe, J. and R. Roe. 2001, updated daily. The FOO Time Series Data Set. The FOO Data Center. doi:10.xxxx/notfoo.547983. Accessed 1 May 2011.

A note on updates vs. new versions:

Ongoing updates to a time series do change the content of the data set, but they do not typically constitute a new version or edition of a data set. New versions typically reflect changes in sampling protocols, algorithms, quality control processes, etc. Both a new version and an update may be reflected in the publication date. The version number should also be included.

Doe, J. and R. Roe. 2001, updated daily. The FOO Time Series Data Set. Version 3.2. The FOO Data Center. doi:10.xxxx/notfoo.547983. Accessed 1 May 2011.

Title

This is the formal title of the data set. It may also include version or edition information, but it is better to track verrsion independent of the title.

Doe, J. and R. Roe. 2001. The FOO Data Set. The FOO Data Center. doi:10.xxxx/notfoo.547983. Accessed 1 May 2011.

Version

Subset Used

This may be the most challenging aspect of data citation. It is necessary to enable "micro-citation" or the ability to refer to the specifc data used. The exact files, granules, records, what have you. An example in a traditional context would be quoting a certain passage in a book, where one then references a specifc page number in the citation. Alternatively one might make reference to the "structural index" of a canonical text (e.g. book, chapter, and verse in the King James Bible). Unfortunately data sets typically lack page numbers or canonical versions. Nevertheless, there is often a consistent structural form to how a data set is organized than can help users cite a specific subset. Data stewards should suggest how to reference subsets of their data. With Earth science data, subsets can often be identified by referring to a temporal and spatial range.

Doe, J. and R. Roe. 2001, updated daily. The FOO Gridded Time Series Data Set. Version 3.2. Oct. 2007- Sep. 2008, 84°N, 75°W; 44°N, 10°W. The FOO Data Center. doi:10.xxxx/notfoo.547983. Accessed 1 May 2011.

Sometimes, the data may be packaged in different sub-collections or "Archive Information Units," which can be referred to.

Doe, J. and R. Roe. 2001. The FOO Data Set. Version 2.0 shapefiles. The FOO Data Center. doi:10.xxxx/notfoo.547983. Accessed 1 May 2011.

Editor, Compiler, or other important role

Occasionally, there are other people besides the authors who played an important role in the creation or development of a data set. Often these people can be charecterized as editors or compilors, but other roles might also be identified. An editor is the person or team who is responsible for creating a value-added and possibly quality-controlled product from the data. In cases where there is minimal scientific or technical input, yet still substantial effort in compiling the product, the person may be more correctly cited as a compiler. Editors and compilers may often be responsible for a larger work that includes multiple data sets from different author data set. Occasionally, there may be both a compiler and editor.

Doe, J. 2001. The FOO Data Set. Version 2.0 R. Roe (ed.) The FOO Data Center. doi:10.xxxx/notfoo.547983. Accessed 1 May 2011.

When there is an editor or compiler but no author, the editor is listed first.

Publication Place

This is the city, state (when necessary), and country of the publisher.

Cavalieri, D., C. Parkinson, P. Gloersen, and H. J. Zwally. 1996, updated 2006. Sea ice concentrations from Nimbus-7 SMMR and DMSP SSM/I passive microwave data, March 2002–Sept. 2003. Boulder, Colorado USA: National Snow and Ice Data Center. Data set accessed 2008-05-14 at "http://nsidc.org/data/nsidc-0051.html".

Publisher

The publisher is whoever published the data set. A publisher often has an implied responsibility for stewardship of the data set. This is usually a data center and is written immediately after the place.

Cavalieri, D., C. Parkinson, P. Gloersen, and H. J. Zwally. 1996, updated 2006. Sea ice concentrations from Nimbus-7 SMMR and DMSP SSM/I passive microwave data, March 2002–Sept. 2003. Boulder, Colorado USA: National Snow and Ice Data Center. Data set accessed 2008-05-14 at "http://nsidc.org/data/nsidc-0051.html".

Distributor or Associate Publisher

This field should be used only when it differs from the publisher, i.e. rarely. Its listing should be written in the same manner as that of publisher. Sometimes NSIDC acts as a simple distributor; sometimes we are an associate publisher; sometimes others are associate publishers.

Environmental Working Group. 2000. Environmental Working Group: Joint U.S.-Russian Arctic sea ice atlas. Ann Arbor, MI: Environmental Research Institute of Michigan; distributed by the National Snow and Ice Data Center. CD-ROM.

Cross, M. compiler. 1997. Greenland summit ice cores. Boulder, CO: National Snow and Ice Data Center in association with the World Data Center A for Paleoclimatology at NOAA-NGDC, and the Institute of Arctic and Alpine Research. CD-ROM.

Distribution Medium and Location

If there is one fixed medium, list it. For example, CD-ROM, DVD.

International Permafrost Association Standing Committee on Data Information and Communication (comp.). 2003. Circumpolar Active-Layer Permafrost System, Version 2.0. Edited by M. Parsons and T. Zhang. Boulder, CO: National Snow and Ice Data Center/World Data Center for Glaciology. CD-ROM.

If data are available over the internet or through multiple digital media options it is best to include a reference to the location of the data. Often this is through a standard URL.

Cavalieri, D., C. Parkinson, P. Gloersen, and H. J. Zwally. 1996, updated 2006. Sea ice concentrations from Nimbus-7 SMMR and DMSP SSM/I passive microwave data, March 2002–Sept. 2003. Boulder, Colorado USA: National Snow and Ice Data Center. Data set accessed 2008-05-14 at "http://nsidc.org/data/nsidc-0051.html".

Ideally, a persistent identifier such as a Digital Object Identifier should be used.

König-Langlo, Gert and Hatwig Gernandt. 2006. Compilation of radiosonde data from the Antarctic Georg-Forster station of the German Democratic Republic from 1985 to 1992. Bremerhaven, Germany: Alfred Wegener Institute for Polar and Marine Research Data set accessed 2008-05-22. doi:10.1594/PANGAEA.547983

Access Date

Because data can be dynamic and changeable in ways that are not always reflected in publication dates and versions, it is important to indicate when on-line data were accessed. It is not necessary to indicate an access date for a fixed medium like a DVD.

Cavalieri, D., C. Parkinson, P. Gloersen, and H. J. Zwally. 1996, updated 2006. Sea ice concentrations from Nimbus-7 SMMR and DMSP SSM/I passive microwave data, March 2002–Sept. 2003. Boulder, Colorado USA: National Snow and Ice Data Center. Data set accessed 2008-05-14 at "http://nsidc.org/data/nsidc-0051.html".

Data Within a Larger Work

A particular data set may be part of a compilation, in which case it is appropriate to cite the data set somewhat like a chapter in an edited volume.

Bockheim, J. 2003. "University of Wisconsin Antarctic Soils Database". In International Permafrost Association Standing Committee on Data Information and Communication (comp.). 2003. Circumpolar Active-Layer Permafrost System, Version 2.0. Edited by M. Parsons and T. Zhang. Boulder, CO: National Snow and Ice Data Center/World Data Center for Glaciology. CD-ROM.

Increasingly, publishers are allowing data supplements to be published along with peer-reviewed research papers. When using the data supplement one need only cite the parent reference. For example, when using the data at "doi:10.1594/PANGAEA.476007", the following reference is appropriate.

Stein, Ruediger, Bettina Boucsein, and Hanno Meyer. 2006. "Anoxia and high primary production in the Paleogene central Arctic Ocean: first detailed records from Lomonosov Ridge." Geophysical Research Letters, 33: L18606. doi:10.1029/2006GL026776.



Versioning and DOIs

Here are some suggestions on how to handle different data set versions relative to a DOI. This is an initial suggestion based on work at NSIDC exploring a variety of different kinds of data sets or production patterns. Revise as you see fit.

  • major version.minor version.[archive version]
  • Individual stewards need to determine which are major vs. minor versions and describe the nature and file/record range of every version.
  • Assign DOIs to major versions.
  • Old DOIs should be maintained and point to some appropriate page that explains what happened to the old data if they were not archived.
  • A new major version leads to the creation of a new collection-level metadata record that is distributed to appropriate registries. The older metadata record should remain with a pointer to the new version and with explanation of the status of the older version data.
  • Major version (after the first version) should be captured in the data set title.
  • Minor versions should be explained in documentation, ideally in file-level metadata.
  • Minor versions should be exposed in the data set title and recommended citation.
  • Applying UUIDs to individual files upon ingest aids in tracking minor versions and historical citations.