Interagency Data Stewardship/2009AGUTownHall

Peer-Reviewed Data Publication and Other Strategies to Sustain Verifiable Science

December 17, 2009 at the AGU fall meeting in San Francisco

Bernard Minster started the town hall off with a review of the AGU's newly revised position statement on data which calls for:

Full and open sharing of data and metadata for research and education
Real-time access to data that are important for responding to natural hazards or that are needed to support environmental monitoring or climate models
Endorsement of the concept of data publication, "to be credited and cited like the products of any other scientific activity, and encouraging peer-review of such publications."

He focused on the need to change the earth science communities' mindset on data collection – people who collect data do NOT get recognized for the work they do and should; users of the data get the recognition because they publish papers. He concluded by noting that Earth and space science data are a world heritage and it is our collective responsibility to preserve this resource.

Mark Parsons followed up by noting that a wide variety of organizations and projects support the concept of data citation (e.g., IPY, PANGAEA, NASA DAACS, USGS, NOAA National Data Centers), though not in any sort of uniform or standard way. He then discussed the International Polar Year (IPY) citation guidelines which are a synthesis of the different approaches agreed to by many international data centers. The IPY guidelines are analogous to the rules used in the publication process. The Citations should include (as appropriate):

Authors (people directly responsible for acquiring the data)
Dates (data publication date – not its collection)
Title of the data set
Editor (the person who compiled the data set from other materials or performed QA on the acquired data, etc.)
Publisher (the organization, often a data center, that is responsible for archiving and distributing the material)
Version
Access date & URL
Should include a DOI if one exists

and noting variations for time series data where data sets are dynamic and where perhaps only a subset of the data is used

Algorithm developers are the authors
Date – add to the date published an indicator of how often the dataset is updated (e.g., updated daily)
Dates of data used

Finally Ruth Duerr gave a brief overview of digital identifier technologies and their uses:

To uniquely & unambiguously identify a particular piece of data no matter which copy a user has (e.g. UUID)
To locate data no matter where they are currently held (e.g. handles, PURL’s, OIDs)a
To identify the data cited in a particular publication (DOI)
To be able to tell that two files contain the same data even if the formats are different. In other words, to determine if two files are “scientifically identical” to use Curt Tilmes' terminology.

At that point three proposed discussion questions were put to the audience for consideration:

How should the intellectual effort of data publication be recognized?
Is peer review for data appropriate and how might it be implemented?
Is formal data citation appropriate and how might it work?

Recognition

In general, the consensus seemed to be that properly preparing data so that others can understand and re-use them is important, that it can take a lot of work and intellectual effort to collect and prepare data, and that that effort should be formally recognized as a contribution to the community and as an important consideration in tenure decisions. Currently a person who publishes really good data is given less credit than someone who publishes a bad paper.

While citation was noted as one possible route for this, the issue of separately recognizing multiple actors in the creation of a well documented data set has yet to be resolved. For example, how do you provide credit for metadata creation? In some cases, there are mechanisms for handling this. For example, if a published data set is updated with quality information or other significant materials by another investigator, it would be appropriate to create and publish a new data set that includes the additional information. The designation of data set editors can also be useful. Authors may be those who conceived and designed the data collection, while editors may have provided QC processes, reformatting and documentation for better data integration, etc.

Peer Review

Much of the resulting discussion surrounded the concept of peer-reviewed data, what peer-review of data means and how it might be implemented. One member of the audience noted that non-peer reviewed data is like presenting a paper at a conference and that a quality data set, one that could be considered to have been peer-reviewed, should have metadata, that metadata should make sense, and that all the contextual information needed to truly understand and use the data should be available (e.g., calibration information). Another noted that in some sense, a data citation is a form of peer review. The theory being that if the data are used and cited, then they must be useful. That argument was countered with an example where something was cited often as an example of what not to do! Another proposed that capturing community responses to published data is a form of peer-review since in the past processes like these have often pointed out problems with the data. Another participant voiced the opinion that peer review is like auditing - it answers the question of whether scientists are following sound principles in producing their data. In general, it was noted that data standards are lacking and that they would simplify peer-review. For example, standard formats would make it easy to determine whether a data set was properly formatted and ready for publication. Data validation, as something that should be independently verifiable, was also noted as an area that a peer-review process should consider. Lastly, it was pointed out that development of community best practices are key - that once established, producers whose data don't meet them have failed the peer-review test. Data publishers, i.e. data centers as opposed to data providers or authors, have a central role in establishing appropriate peer review processes.

In general it was noted that scale is an issue. Peer-review of millions of files is not possible, though the consensus seemed to be that projects producing that much data should have quality assurance processes, processes that also should be documented. This lead to some discussion of the difference between quality assurance and peer-review - the conclusion to which seemed to be that quality assurance is done under the auspices of the producers of the data, peer-review by independent groups or individuals.

Another issue noted was that data quality isn't completely a function of the data itself; but rather a function of the use to which the data are put. For example, it is often the case that data that are appropriate for one use, are totally inappropriate for use in another context.

While no resolution to the question of peer-review resulted, one point became clear. That is that the processes or even the need for peer-review of data depends strongly on the size of the data set or project producing it. At one extreme, are the large data sets (e.g., MODIS) produced in industrial fashion by many of the agencies or large programs. In these cases, the algorithms used to process the data have typically been peer-reviewed and documented, for example through the Algorithm Theoretical Basis Documentation required by NASA and production quality assurance and validation processes are typically in place and well described. At the other extreme are processes to peer-review research data collections produced by individual investigators or small projects which rarely produce the level of documentation or undergo the levels of review of the large programs. Any peer-review process needs to take these differences into account.

Data Citation

The point made was that the ability to cite data and to actually access the data used in research is becoming ever more important because of its societal impact on the credibility of scientific results. "Climategate" was noted as an example of the type of influence that is increasingly making citation and associated data access more important. One complicating factor noted, is that a researcher may only have used a subset of a large collection or may have used dozens or hundreds of smaller collections. Any data citation scheme needs to be able to support both cases easily. Others pointed out current problems with accessing cited data, as many active archives only keep the latest version of a data set and previous versions may not be available or reproducible. This lead to a discussion of the notion that even if the data has been purged, the metadata about that data needs to be kept. In that case, the citation would point to the metadata record for the purged data and should explain what happened to the data. Others questioned the notion of citing a proxy to the data (e.g., data set documentation) rather than the data itself. Others noted that not only should the data be cite-able, but the algorithms used to produce them should also be cite-able - that without both the reproducibility chain is incomplete. Lastly, the issue of publication was raised, with one audience member suggesting the creation of a Journal of Earth Science Datasets, the contents of which would simply be articles consisting of the metadata record for a dataset. One such journal already exists: Earth System Science Data

Overall, there was consensus that data sets can and should be cite-able and that journals have a central responsibility to encourage data citation.