Talk:Interagency Data Stewardship/LifeCycle/Jul2009MeetingPlans

How Do I Tell if Two Files are Identical Enough to Have the Same Identifier? -- Bruce R. Barkstrom (Brb) 15:20, 7 April 2009 (EDT) -- Bruce R. Barkstrom (Brb) 15:20, 7 April 2009 (EDT)

Here's a fairly serious couple of questions for the group concerned with identifiers:

1.When are two files that contain the same data "identical"? Answers should be provided by algorithms that purport to answer this question. Examples would be useful. For example, suppose we took an array and embedded it into a binary file, an HDF file, and an XML file. We probably wouldn't call these identical, but we'd probably want to have identifiers that could help say something like "these contain the same data, but they have different formats".

2. More difficult: suppose one of the files underwent a migrational transformation that converted ASCII characters to Unicode, 16-bit integers to 32-bit integers, and floats to double precision numbers. Should the transformed file have the same identifier - or does it need a new one? I'm inclined to think the identifier could be the same - but it could be a new format and therefore a variant of the original file.

3. How about transforming the order of array storage - FORTRAN to C (row-major to column-major)? In this case, I'm probably inclined to think the transformed file needs a new identifier as a variant format, although I think HDF could handle this.

4. What about translations (English to French, English to Japanese or Chinese) of annotations, like embedded parameter names?

5. What do we do about identifiers where some of the interpretation occurs from information read into the read program? A case of this sort might occur when the read program could identify arrays, but relied on an input file for what the OAIS RM calls "semantic information".

Rob Raskin has also provided the following extension:

Unique identifiers should be *multifaceted*, i.e., multidimensional. The identifier could include facets such as: format, data type, and (if applicable) array order and byte order - appended to any subject classification of the data. Some tools may be unable to read data stored in particular ways. This is different from book classifications (Library of Congress or Dewey Decimal) that categorize only by subject and date (independent of whether the book is in paperback or hardcover, blue or green cover, etc.). I believe different languages imply different call numbers, so language should be another facet in the data identifier. We should be cognizant of the differences between books and data, in that the latter can be read by machines, whereas computers do not use the call number information to assist in reading books. The tricky part is to represent customized virtual datasets that are created upon demand (and can be created in multiple ways).

Ruth Duerr has also suggested the following reference:

M. Altman and G. King "A Proposed Standard for the Scholarly Citation of Quantitative Data" D-Lib Magazine March/April 2007 Vol.13:3/4 http://www.dlib.org/dlib/march07/altman/03altman.html

There are then two procedural questions: 1.Do we want to take time during the Federation meeting to deal with these issues? 2.Is this an issue to take up in the proposed metadata testbed? If the answer to 2 is yes, then it might be interesting to propose several test files (in different formats) and ask for volunteers to build a "file identity test" program that could accept two files (or serializations of two files) and provide a response that might be (identical | not identical) or, in a more refined version (data are identical in value, sequence, and parameter | not identical) as well as representation information is identical and annotations are identical. Judging from what I've been trying to write down, this is a bit more difficult than it looks on the surface.

Note also that this issue has implications for records management in government archives, for legal definitions as to what constitutes the original data we want in a citation, and for Intellectual Property Rights management. As an additional note, there's Appendix E in the OAIS RM that has a "layered information model" that I think might provide some useful specificity on these questions - even though it's not a formal part of the standard.