Talk:Interagency Data Stewardship/LifeCycle/Jul2009MeetingPlans

How Do I Tell if Two Files are Identical Enough to Have the Same Identifier? -- Bruce R. Barkstrom (Brb) 15:20, 7 April 2009 (EDT) -- Bruce R. Barkstrom (Brb) 15:20, 7 April 2009 (EDT)

Here's a fairly serious couple of questions for the group concerned with identifiers:

1.When are two files that contain the same data "identical"? Answers should be provided by algorithms that purport to answer this question. Examples would be useful. For example, suppose we took an array and embedded it into a binary file, an HDF file, and an XML file. We probably wouldn't call these identical, but we'd probably want to have identifiers that could help say something like "these contain the same data, but they have different formats".

2. More difficult: suppose one of the files underwent a migrational transformation that converted ASCII characters to Unicode, 16-bit integers to 32-bit integers, and floats to double precision numbers. Should the transformed file have the same identifier - or does it need a new one? I'm inclined to think the identifier could be the same - but it could be a new format and therefore a variant of the original file.

3. How about transforming the order of array storage - FORTRAN to C (row-major to column-major)? In this case, I'm probably inclined to think the transformed file needs a new identifier as a variant format, although I think HDF could handle this.

4. What about translations (English to French, English to Japanese or Chinese) of annotations, like embedded parameter names?

5. What do we do about identifiers where some of the interpretation occurs from information read into the read program? A case of this sort might occur when the read program could identify arrays, but relied on an input file for what the OAIS RM calls "semantic information".

Rob Raskin has also provided the following extension:

Unique identifiers should be *multifaceted*, i.e., multidimensional. The identifier could include facets such as: format, data type, and (if applicable) array order and byte order - appended to any subject classification of the data. Some tools may be unable to read data stored in particular ways. This is different from book classifications (Library of Congress or Dewey Decimal) that categorize only by subject and date (independent of whether the book is in paperback or hardcover, blue or green cover, etc.). I believe different languages imply different call numbers, so language should be another facet in the data identifier. We should be cognizant of the differences between books and data, in that the latter can be read by machines, whereas computers do not use the call number information to assist in reading books. The tricky part is to represent customized virtual datasets that are created upon demand (and can be created in multiple ways).

Ruth Duerr has also suggested the following reference:

M. Altman and G. King "A Proposed Standard for the Scholarly Citation of Quantitative Data" D-Lib Magazine March/April 2007 Vol.13:3/4 http://www.dlib.org/dlib/march07/altman/03altman.html

There are then two procedural questions: 1.Do we want to take time during the Federation meeting to deal with these issues? 2.Is this an issue to take up in the proposed metadata testbed? If the answer to 2 is yes, then it might be interesting to propose several test files (in different formats) and ask for volunteers to build a "file identity test" program that could accept two files (or serializations of two files) and provide a response that might be (identical | not identical) or, in a more refined version (data are identical in value, sequence, and parameter | not identical) as well as representation information is identical and annotations are identical. Judging from what I've been trying to write down, this is a bit more difficult than it looks on the surface.

Note also that this issue has implications for records management in government archives, for legal definitions as to what constitutes the original data we want in a citation, and for Intellectual Property Rights management. As an additional note, there's Appendix E in the OAIS RM that has a "layered information model" that I think might provide some useful specificity on these questions - even though it's not a formal part of the standard.

Contributions from Steve Morris and Reagan Moore for Data Stewardship Workshop -- Bruce R. Barkstrom (Brb) 10:26, 8 May 2009 (EDT) -- Bruce R. Barkstrom (Brb) 10:26, 8 May 2009 (EDT)

Steve Morris is Head of Digital Library Initiatives at North Carolina State University and has been working on a project involving ingesting various kinds of geospatial data into a repository there. This project involves close collaboration with state and local agency geospatial data producers with an eye to engaging spatial data infrastructure in the issue of data archiving. Most of this work involves GIS technologies, including spatial databases. The total data volume is much smaller than the NASA and NOAA repositories that the ESIP Federation has discussed to date. In addition, network capacity of some project partners is so limited that it has been necessary in many cases to distribute data on external hard drives. The approach has the advantage of low expense and is quite acceptable because there is no demand for very low latency between the time the data have been collected and the time it will be used.

The production paradigms for this work are quite shallow only one or two levels of production. Thus, it appears they have been able to work with a suggestion from the Library of Congress (LoC) that has three levels of data products (which should NOT be confused with the three we've used in NASA and NOAA discussions). A particularly pressing issue for this community lies in the transience of some kinds of metadata - notably in the case of shape files, where the documentation may not accord with undocumented changes in the data received after the documentation has been created. By analogy, shape file modifications appear to be operating at a level of granularity equivalent to what one can do with database transactions. Thus, versioning would need to be continuous. Probably the only way to track this kind of transformation would be to maintain a database of transactions that could be rerun or audited to establish the actual history of the transformations.

Steve has had some exposure to the legal ramifications of this problem, which is particularly acute for their community. First, members of this community may have difficulties obtaining the equivalent of an OAIS RM Submission Agreement. The personnel at state and county offices who would need to provide a Submission Agreement are overloaded. In addition, these folks would get very nervous about bringing in lawyers - both out of general desire to avoid complex entanglements with the law and out of the high likehood that this kind of interaction will delay getting on with business. Second, much of the data interchange is done informally and requires a diplomatic willingness to avoid the potential legal entanglements that might impede data flows. Since much of their data involves boundaries, there are financially interested parties on both sides of legal disputes over data. This situation appears likely to lead to some very interesting and VERY costly requirements to overhaul the Intellectual Property Rights transfers that appear in discussions of provenance.

For the July meeting, Steve will provide a 15 to 20 minute discussion of his community's experiences and of that community's tools.

Reagan Moore is a chief scientist at the North Carolina Rennaisance Computing Institute (RENCI), a professor in the School of Information and Library Science at UNC, and Director of the DICE Center at UNC. The Renaissance Computing Institute (RENCI) is a multi-institutional organization, brings together multidisciplinary experts and advanced technological capabilities to address pressing research issues and to find solutions to complex problems that affect the quality of life in North Carolina, our nation and the world. Founded in 2004 as a major collaborative venture of Duke University, North Carolina State University, the University of North Carolina at Chapel Hill and the state of North Carolina, RENCI is a statewide virtual organization.

Professor Moore is widely recognized as one of the "prophets" of long-term digital preservation. He has been one of the developers of the data grid technology known as the Storage Resource Broker (SRB), which organizes distributed data into shared collections. A number of institutions manage petabytes of data using the SRB data grid. In the last several years, Prof. Reagan and his colleagues have developed the next generation of data grid technology called the Integrated Rule Oriented Data System (IRODS) [see https://www.irods.org/index.php/Introduction_to_iRODS for an introduction]. Reagan will give one of the 1-1/2 hour demos in the technology sessions on Monday.

He will also participate in the workshop on day 2 and 3 of the meeting. Prof. Moore and his colleagues have had substantial experience with distributed data management and will contribute a 15 to 20 minute summary of lessons they have learned about distributed data management, including experience with data identifiers and provenance.

One potentially interesting discussion for the workshop might be whether we need just one set of identifiers for each file or more than one. Prof. Moore has suggested that multiple identifiers might serve as a form of "Rosetta Stone" that would allow translations between facets of interest to different communities and thus improve the overall ability of data users to find what they need. He recently observed that every new community we encounter has a new dialect and identifier convention. With a community as broad as the Earth science data producer and user community, it is probably impossible to expect convergence on a single dialect - even with strong inclinations toward such convergence on the part of agency management.

This consideration leads to the question "would it be less expensive to allow locally autonomous data centers to translate community protocols into their own dialect or to impose a single dialect on all of the exchanges?" A highly centralized approach is politically difficult because of "turf naming wars" between the Earth science disciplines and data user communities. Such an approach leads to lengthy discussions amongst high-level agency officials and senior scientists. It may be that a more federated approach would be faster and cheaper, since a global solution is not required.