Difference between revisions of "Collection Structure"

From Earth Science Information Partners (ESIP)
 
(One intermediate revision by one other user not shown)
Line 6: Line 6:
 
Attendees: Mark Parsons, Donald Collins, Curt Tilmes, Sarah Ramdeen, Bruce Barkstrom (session lead), Ruth Duerr, Nic Weber
 
Attendees: Mark Parsons, Donald Collins, Curt Tilmes, Sarah Ramdeen, Bruce Barkstrom (session lead), Ruth Duerr, Nic Weber
 
Remote Attendees: Nancy Hoebelheinrich
 
Remote Attendees: Nancy Hoebelheinrich
 +
 +
[[Media:Revised_Collection_Structure_Breakout_Session.pdf| PDF]]
  
 
=== Background Review and Session Goals ===
 
=== Background Review and Session Goals ===
Line 33: Line 35:
 
=== Moving to a Refinement of the OAIS RM Object Categorization ===
 
=== Moving to a Refinement of the OAIS RM Object Categorization ===
  
The OAIS model distinguishes between physical objects and digital objects.  Bruce had created a more refined hierarchical classification of object types that had finer distinctions.  He selected a few categories from the list and gave examples of the attributes that distinguish that category from other objects.  One example was the category of physical documents.  He drew attention to the differences between formal documents and informal ones. Observers who have formal training learn to use a standard format for recording their measurements.  The author of an informal document does not have a standard form dictated by the organization for whom he or she is writing.  The standard form is  
+
The OAIS model distinguishes between physical objects and digital objects.  Bruce had created a more refined hierarchical classification of object types that had finer distinctions.  He selected a few categories from the list and gave examples of the attributes that distinguish that category from other objects.  One example was the category of physical documents.  He drew attention to the differences between formal documents and informal ones. Observers who have formal training learn to use a standard format for recording their measurements.  The author of an informal document does not have a standard form dictated by the organization for whom he or she is writing.  The standard form is intended to reduce the probability of errors when the document records the measurements the observer made or when the observer makes calculations from the observations.
intended to reduce the probability of errors when the document records the measurements the observer made or when the observer makes calculations from the observations.
 
  
 
Another feature of the classification was the desire to base it on a taxonomic algorithm.  In such an algorithm, the classifier identifies the attributes of the object that make its category different from the other object categories.  A person that wanted to independently verify the classification needs to be able to test whether an object has the attributes that define the category.
 
Another feature of the classification was the desire to base it on a taxonomic algorithm.  In such an algorithm, the classifier identifies the attributes of the object that make its category different from the other object categories.  A person that wanted to independently verify the classification needs to be able to test whether an object has the attributes that define the category.

Latest revision as of 16:50, February 20, 2014

ESIP Winter meeting 2014

Collection Structure Group Breakout Session

Meeting notes

Attendees: Mark Parsons, Donald Collins, Curt Tilmes, Sarah Ramdeen, Bruce Barkstrom (session lead), Ruth Duerr, Nic Weber Remote Attendees: Nancy Hoebelheinrich

PDF

Background Review and Session Goals

Bruce noted that in the previous six months of telecons, technical difficulties made the group's work more difficult than would be desirable. Hopefully, we will be able to reduce those difficulties in the future.

The goals for this session include

  1. Clarifying the categories of stable objects in an archive's inventory
  2. Clarifying how an archive inventory would account for the objects
  3. Suggesting policies for inventory accounting for indistinguishable objects
  4. Suggesting an approach for identifying scientifically equivalent data in different files
  5. Suggesting policies for inventory accounting of scientifically equivalent data files

Bruce stressed that he is looking for feedback from the group.

Defining Objects Contained in Archives

Bruce started with a sequence of definitions that provide a basis for defining categories of objects in archives. The first definition used the notion that objects curated by an archive should be stable. Curt commented that there are some intermediate products for which archives may track provenance although the objects are immediately deleted. As a result, these objects do not fit these definitions. Bruce agreed that transitory objects do present an issue. Curt said there are some objects they do track with DOIs and some they don’t. Bruce thought that was a good point, although this complicates the definition. Ruth confirmed the definitions on the slide were Bruce's set of terms. Then, she asked `Are you going to want to relate them to definitions by other groups working in the same area?' Bruce said yes (particularly PCCS), but we have not had the time in the telecons. For today, he suggested using these definitions for the breakout session's discussion, with revisions being developed in the period after the session was over.

Referring to the construction of the classification, Bruce's thinking was strongly influenced by Formal Concept Analysis (FCA), which provides an abstract, mathematical basis for taxonomies and classification. FCA starts with a binary matrix that identifies the attributes associated with a set of objects. The FCA algorithms produce a set of Formal Concepts that have a consistent set of attributes that are common to a set of objects. He stressed there is no connection a Formal Concept and a person's label for that grouping of objects and attributes. That might seem strange. However, this subject would be too difficult to explain in detail within the time available.

Mark asked whether a database would fit into the categorization hierarchy from the classification slide. Bruce said that the major distinction between databases and Data Files or Documentation Files is that an intrinsic part of databases are special content organizations and tools for manipulating and providing access to the database contents. These structures and access mechanisms are not part of digital files.

There was additional discussion of subsets and subcollections. Bruce related this discussion to the example in which an archive of rock cores could be regarded as a collection of wells. Each well would archive a collection of boxes that contain rock fragments. The fragments consist of rock samples made of minerals. Curt noted that any of these subcollections could be described as an array of objects but could also be called a subset. Bruce agreed that this would be an alternative description.

Moving to a Refinement of the OAIS RM Object Categorization

The OAIS model distinguishes between physical objects and digital objects. Bruce had created a more refined hierarchical classification of object types that had finer distinctions. He selected a few categories from the list and gave examples of the attributes that distinguish that category from other objects. One example was the category of physical documents. He drew attention to the differences between formal documents and informal ones. Observers who have formal training learn to use a standard format for recording their measurements. The author of an informal document does not have a standard form dictated by the organization for whom he or she is writing. The standard form is intended to reduce the probability of errors when the document records the measurements the observer made or when the observer makes calculations from the observations.

Another feature of the classification was the desire to base it on a taxonomic algorithm. In such an algorithm, the classifier identifies the attributes of the object that make its category different from the other object categories. A person that wanted to independently verify the classification needs to be able to test whether an object has the attributes that define the category.

Figure 4-1 of the OAIS Reference Model notes that metadata for managing data goes into a database that is separate from the media storage that contains digital data and documentation. Mark said that with digital object categorization presented in this initial classification we are exclusively talking about data files and not databases. Bruce agreed - and said no, we are not talking about databases. The RM puts metadata into an archive's database, but it puts other digital objects into media storage. From a computational standpoint, databases incorporate specialized mechanisms for ingest, querying, and managing the database that are not included in digital files. Thus, the mechanisms are a distinctive attribute for distinguishing database objects from more ordinary digital files. Mark asked where model output would fit in this classification. Bruce answered that we might need to add model output to it.

Slides With Examples of the Classification Approach

Catalogers who want to categorize objects often use verbal definitions for this work. For example, section~3.2.1, p.~17 of the Functional Requirements for Bibliographic Records: Final Report [IFLA, 2009} defines a ``work as ``a distinct intellectual or artistic creation. Bruce suggested that he would rather not spend time discussing whether an automated instrument was ``thinking when it recorded its measurements. Rather, it seems reasonable to identify such an instrument and a trained human being as ``observers rather than ``authors of the observational record. That way, we don't need to get caught up in a lengthy and quite abstract discussion of whether an automated instruments was an ``author of the record.

Bruce gave an example of measuring the depth of water under a ship during a category 5 hurricane 250 years ago. The observer would have to make the measurement in a howling wind and battering surf. This is not the time for the sailor to have an intellectual creation, i.e. to think. Rather, he is trained to record the depth of water from a knotted line thrown over the ship's side. Once he notes the depth, he is probably thinking only about getting back to the relative (perceived) safety of the ship's cabin. The observer is trained to make sure the recorded observations are accurate, even if that involves risk to life and limb. Bruce noted that Earth science tends to be observational. Thus, each observation is unique because the conditions of observation are uncontrolled and vary in both time and space. On the other hand, published papers in Earth sciences may contain original intellectual creations by human authors.

He also suggested that a data file from NCDC's Global Historical Climate Network (GHCN) could supply another example. Another object would be a digital file containing only data from one of the NASA EOS missions.

Don asked about objects - a file with data values and with representation information about how the values were encoded. Both the values and the encoding were bundled into one file. Bruce said that would be a digital object. It would be a digital data file, with representation information and data. The data values could be text or numerical values. The representation information identified in the OAIS RM could be encoded in several different ways without affecting the data values. Bruce asked if this separation was clearly understood. The attendees agreed that it was.

Collection Structure as Organized Subsets of Objects in an Archive's Collection of Data and Documentation

Bruce moved on to a production graph that shows the connection between both data files and Production Generation Executables (PGEs). Each box in the figure is a file which contains data. Other shapes show the PGEs. At sufficiently high zoom (3200%), each of the shapes has a text label. Some of the files contain one day of data. Other files contain a monthly average of daily values. In addition, there are two versions of some some of the daily files in the figure, as well as the monthly files. Bruce pointed out how this diagram captures the production provenance for all the files in a month's worth of data. The collections are presented as horizontal groupings. Each group is a collection of files. He mentioned some other groups, but was convinced that the group understood the organization. Thus, the attendees could move on to the discussion questions related to classification concepts.

Discussion of the Categorization

Bruce asked about independent verification of object classifications, and how the attendees would improve the categorization.

Ruth said there are categories of data not yet covered here. As an illustration, she mentioned IDEDA - Kerstin’s data center. Every single data set is its own independent data point with its own production graph. This is a different type of collection but all the other concepts seems to apply. Bruce thought it would be useful to understand how this example might be expanded within this context. He asked for other examples. Ruth mentioned streaming data, but that might be artificial because the producers do not cut it up.

Bruce mentioned staging data from disc to tape, where the disc copy appears. After data production uses the staged data, it disappears.

The intent of the classification is to use the objects as the atomic elements in an archive. In other words, collections are built from various combinations of these atomic elements.

Ruth suggested other cases, including collections of collections. In these examples, the relationship between the different collections is not derived from a production graph (or from production provenance relations). Rather the collections may be based on the curator's sense that the collections have a thematic relationship. For example, an art collection that has still life paintings could form a subcollection within a larger collection of paintings from the 17th century. The full collection would include other subjects, such as landscapes or portraits. The appropriate principles for such groupings are likely to represent a curator's sensibility with respect to the cultural expectations regarding categories. This example suggests that particular scientific disciplines might organize collections using different categorization principles and that these categorizations might evolve over time.

Bruce said that we can do a hierarchical categorization of atomic elements. For Ruth's example, a museum's collection could have paintings. The paintings in a particular room form a subcollection. Since the museum may have many rooms, the museum may have many subcollections. Bruce wanted to avoid a lengthy discussion until he know that there was a common understanding.

Note Added in Editing the Minutes of the Breakout Session [BRB]

This notion of evolving collection organizations may be important for helping data users understand how to navigate through the collection structures to find objects that interest them. As an example, we might expect that a collection of biological specimens organized using a classical Linnean taxonomy would not be readily navigational by elementary students. On the other hand, such a collection organization would be readily navigated by classically trained biologists. Biologists trained in the application of DNA relationships (cladistics) might find another organization easier to navigate.

Another example of such a classification evolution is the movement from the Dewey Decimal classification to the one based on Library of Congress classification.

The important lesson is that the classification principles for taxonomies are not necessarily fixed by an initial notion of how objects and collections of objects are related. Rather, the collection may be reorganized based on new perceptions or principles.

The malleability of relationships managed by computers makes it much easier to invent new navigational search structures – and even append them to previous ones. In the previous era of library classification, the physical items in a library collection could only have a single spatial organization. As Petroski points out in ``The Book on the Bookshelf, over centuries the ``natural spatial order of book categories evolved because there were new insights into the categorization principles. Thus, in the 1600's Pepys could prefer a collection ordered by the size and color of the books in his personal library's book cases. The important criteria to Pepys was the aesthetic pleasure of the collection displayed in his library's book cases. Llibrarians in the 1900's used the Dewey decimal system to provide the order for the spatial location of books in their collections. Later, the cataloging principles of the Library of Congress became dominant. Librarians would have to rearrange the books on the shelves when the principles changed.

As cataloging principles developed, faceted classification became feasible – although it can substantially complicate the organization of collections. Even the typical facets of (subject, title, author) have made practical implementations of the Functional Requirements for Bibliographic Records (FRBR) difficult. Thus, the classical hierarchies of Dewey and the LoC have begun to move to multi-path graphs of relationships. From a mathematical standpoint, these become graphs with colored links where a user can navigate along a single color of link just as skiers might descend a hill using black diamond trails or blue ones.

In the future, some research directions suggest that paths through collection s may move to using different metadata guides for different portions of the navigation.

End of Added Note

Ruth mentioned an individual cell in a database, although the object might also represent a row or column in a database table. It was not clear whether Ruth was thinking about metadata or data values maintained in a database. Bruce suggested that we avoid moving the discussion too deeply into the database world, although databases might be another category of object holding Earth science data. The basis for this preference is the OAIS distinction between metadata which is partitioned from data at ingest by being stored in a database. The RM stores data-containing objects in media. Curt mentioned Google Earth and how that also does something different. Donald said that doesn’t have the same relation to the archival object. Google Earth assembles the archival objects for human consumption but does not treat the objects for human consumption as items that need to be added to the inventory.

There was some discussion of the differences between individual rock core samples and the notion that a rock core could be treated as a single object. The discussion suggested that there might be some exceptions to these definitions. Donald talked about the subcommunities within his organization. Each subcommunity might have varying definitions for an object. Thus, the collection organization should have the flexibility to reflect the views of each subcommunity. Using this approach would allow users to have different contexts and different community approaches to the same object. Bruce mentioned the classification he was suggesting in his presentation was provisional at this time. Ruth stressed with the possibility of additional exceptions.

Bookkeeping

Bruce talked about the accounting concepts (see slides). He thought that an object's state might include its location. He elaborated on the definitions in the slides, but didn’t want to belabor the topic.

He made some observations about the fact that it may be difficult to record a monetary valuation of objects in an archive. Thus, it may be more appropriate to simply count objects. In some cases, such as nails in boxes or sand in piles, it may be more sensible to record a more continuous measure of the size of a collection. Thus, a hardware store could record the number of boxes of 10# nails rather than the number of nails or the weight of sand, rather than the number of individual grains.

A key issue in accounting is maintaining a list of accounts, noting that such a list is usually called a `chart of accounts.' The usual approach is to create a hierarchy. Some accounts might need a deeper hierarchy than other accounts, so the branches of the hierarchy may have different depths.

Bruce talked about examples of the accounting systems for library book inventories, where books may be checked out or in. The library has a journal that records a book checkout as a transaction where the library notes that the book was checked out to a particular patron. When the patron checks the book back in, the system creates another transaction that notes the patron has returned the book to the library.

He also mentioned that the U.S. Treasury Department has a Standard General Ledger that provides the top level of the hierarchy for a chart of accounts. When agency program and project managers create a more detailed list of accounts, they append identifiers to the label of an account that include strings of digits for agencies and then projects. This OID-like structure is characteristic of governmental charts of accounts, as well as routine commercial practice. When Congress wants to promote good accounting practice, it may direct agencies to produce a common chart of accounts that merges the charts of individual groups. If the suborganizations don't like it, they may simply concatenate the identifiers and then pull out the original numbering scheme from the concatenation.

Discussion of the Multiplicity of Approaches to Inventory Accounting and the Organization of Charts of Accounts

The discussion of accounting next moved to the question of how to deal with inventory accounts when some of the objects in an archive are exact replicas of each other. Bruce asked if the attendees want replicas placed in a single account account or separated so each object was kept in its own account.

Ruth said there is a more difficult question to be asked. She thought there are are several accounting implementations that use different approaches. The Data Convervancy has apparently published a report dealing with this issue. Bruce asked for a reference to the report. She said that replicas were not treated as the same object. The Data Conservency does use FRBR. In this approach, the object might be a spreadsheet containing scientific data. They archive the object and would note if there were another object with the same data and a different representation. Donald said it would be in the same account. They do something similar in his organization. He discussed the process and how the process is captured.

Bruce said this is a good thing to note. And when we get into this, we probably would be wise to talk about how we would organize a chart of accounts.

Ruth said in the systems she is familiar with, they can get pretty deep but every one could be different. A collection of collections, collection with data items and sub collections etc. and they are all realistic. So you can characterize objects as containers or instances, but beyond that you can’t find a lot of commonality from one organization to another. As a result, it is difficult to characterize the services different archives provide.

She gave an example of a polar data center which organizes their data by projects with data sets with metadata and sub data sets with sub data sets with items etc.

Bruce said different communities have different languages and dialects based on the experiences of the community members. In some sense, tying to develop guidance for practical implementations is like field ethnography. We have to explore and come back with a report of what we have encountered.

Don had a question for Ruth - in the situation described, what do you or what does the group think is the archival object? The archival package as opposed to the information defining them as part of a collection - for a project worth of stuff? Ruth said projects have collections of archivable objects. The content varies from project to project. Data files are probably data items and may or not have metadata attached to them. But collections contain data items and have additional information which is not part of the metadata about those items which is also archivable and needs to be archived. These would usually be called collection information packages.

Bruce noted that one view of the collection structure is embedded in the inventory's chart of accounts. However, if the archive's curators develop other views, these can lead to faceted searches. These may create separate classifications, in which each facet would have a different set of subsets of the total set of objects in the archive's collection. Donald said, again we have similar packages - discrete or continuous. The archive will try to manage the connections external of the object itself, using the metadata. Thus, an archival package can be part of more than one grouping of stuff. In this case, the collections might also evolve through folksonomies, where the classifications are developed by tagging them from social media tags. Ruth mentioned that a lot of these classifications are provincial.

Bruce said there are multiple graphs to describe the collection structures. That multiplicity complicates the description of collection structure.

Sarah mentioned varying viewpoints, which might be different for Designated Community members from the original Data Producer community and for members outside of that community. Ruth added that you can enter a collection from the project viewpoint or from the object state in her example. Donald mentioned the descriptive metadata and that being able to narrow down from varying levels so it doesn’t matter if the objects came in from different projects. The objects might share some metadata attributes. Curt suggested the metadata attributes are part of the collection classification. Faceted search across attributes might require archives to invent new collections arbitrarily.

Bruce mentioned that facets can lead to some interesting mathematics. A mathematician could represent the stucture of objects in an archive as a mathematical graph where facets provide links of different colors. In other words, the links that represent the relationships between objects have attributes. A faceted search is equivalent to following links with the same color in traversing the graph. Having more than one path through the graph makes this kind of structural or navigational search a rather interesting problem.

Ruth had a complication from the Data Conservancy. That group does not just serve Earth science data. Rather, it handles Earth science data, space science data, as well as social science data. Having one metadata standard is pointless. The metadata should make sense for each group. An archive can index its collection based on what is relevant to the various communities it serves. When a user does a query, his or her (or its) decisions on which facets to use depends on the user's knowledge about which types of data are of interest.

Bruce said he would express that differently. Choosing a facet requires a user to choose a path through the collection's structure.

Ruth said if you are doing an open search for Earth science, the objects obtained from a query would be a subset but only the ones relevant to that domain. Bruce said he was keeping it simple. But she is right. There is a formalization problem.

Replication

The slides visualize a situation where an archive creates both online and offsite backups that most people would assume are identical -- and usually are.

Donald and Ruth had a discussion about two different copies and how you can develop trust. Bruce said he wanted to put off authenticity until next time we get together. He was just visualizing a situation where there were multiple copies.

Question: How do you want the archive to handle replicas? If an archive has only one copy, that could be a single point of failure for preservation. If there's a hardware failure or a successful security incident, the scientific information could be lost permanently. There are other information loss mechanisms as well. For example, an undetected router failure caused corruption of a large number of files transfered from one location to another. Donald said that is the case for running a checksum on both ends - before and after you sent to make sure it is a failure of sending rather than something else.

Curt noted that doing so increases the work and files to read though. When the number of files is large, that resource use is not trivial.

Scientific Equivalence

Bruce extended the replica discussion by asking when do we have scientific equivalence between different digital objects? Would it be sufficient to know that a knowledgeable person could reproduce numerical values even if the representation of the scientific data was different?

Sarah asked if this was like defining them along the lines of FRBR? And Ruth said yes, two objects with different representations but equivalent data would be like having two manistefations in the FRBR nomenclature. [BRB is comfortable with this extended nomenclature, although he would rather not extend up the FRBR chain to include `expression' or `work' in the discussion. As noted above, it doesn't seem appropriate to have to invoke mental processes when instruments make automated records of measurements. Perhaps a more precise nomenclature would be that digital objects with equivalent scientific data values are `Formal Manifestations' of the equivalence class.]

Interpreting Bit Arrays in Digital Data Files

Bruce moved on to discuss the interpretation of bits in a digital file. The OAIS RM defines a Digital Object to be an array of bits. There are two items that are needed to interpret the array: (1) a sequence of integers that defines the length of a subarray of bits, and (2) an interpretation of each subarray. The slides call the subarray a `token'. Thus, the bit array is broken into an array of tokens. At the next level up, the interpretation assembles arrays of tokens. A simple example of this two-level interpretation is that an ASCII text file makes each token from eight bits. Each of these bytes has a standard interpretation as an ASCII character. Thus the array of bits in the ASCII file is equivalent to an array of characters. The next level of interpretation leads to words, word phrases, or even sentences. Each of these object types is an array of characters. Of course, some files contain bits that represent integers or floats or double precision numbers.

It is important to note that the interpretation of the tokens in a string depend on both the order of the characters in the array and on whether the character is upper case or lower case (at least in English). However, two strings of numerical values may be scientifically equivalent even if the order is permuted. A two-dimenaional FORTRAN array can be scientifically equivalent to a two-dimensional C array because these two languages format the sequence of numerical values in different orders when a computer program writes them to a disk file. A spreadsheet that reads in a set of ASCII-formatted numerical values and automatically converts these values to double precision values for internal use provides another example of scientifically equivalent numerical values with different represenations.

Encoding Representation Information

The RM describes Representation Information as ``the information that maps a Data Object into more meaningful concepts. [p. 1-14] There are at least three ways for a Digital Data Object to encode the Representation Information: Embed the encoding in source code that performs a function equivalent to a FORTRAN FORMAT statement. Embed the encoding in the created file, perhaps as XML tags or as the internal directory structures used in an HDF file. Put the encoding in an external file that contains the formatting instructions. The PGE might read in the entire file or stream in the instructions as it creates the output file.

Bruce suggested that the interpretation should not depend on how the digital object receives the Represenation Information or how the interpretation of the object is done. Bruce doesnt feel that an extended philisophical debate over whether a Digital Data Object created with one method has different data from another Object that encodes the scientific data with a different method is fruitful. His perception is that it would be more useful to discuss algorithms that could indepdently verify that two digital objects that contain scientific data have data values that are scientifically equivalent in the sense that a knowledgeable user could use either represenation to derive the same results.

Examples to Test Algorithms for Determining Whether to Sets of Scientific Data Belong to an Equivalence Class

As an example Bruce had created a list of different text conventions for the term `ocean color'. The slides show different spellings and different conventions of upper and lower case for this keyword phrase. From a human standpoint they might be the same. On the other hand, computers seem likely to treat them as distinct, unless using alias and other complications.

The next example came from a spreadsheet Bruce sent out before the meeting. In that spreadsheet parameter key words appeared in different collections, such as the Essential Climate Variable lists prepared by GEOSS, GCMD parameter keywords, and keywords from Unidata's Climate and Forecasting metadata schema. It was clear that each community had its own favored list of terms -- and suggests that the disciplinary communities do not engage in extensive normalization of their dialects.

Discussion of Dialect Heterogeneity

Discussion questions: (1) what should we recommend about semantic heterogeneity? (2) do we have a best practice suggestion to handle the divergence between humans and machines?

Curt pointed out that we are talking about archivists who are accustomed to dealing with controlled vocabularies. Ruth said these were good when users could pick keywords from a list. Curt suggested this was useful when the list had proper governance.

Bruce asked about providing hints and alternatives. Curt said this is its own problem and this can be saved for a different time. Out of scope for collection structure.

In the FRBR, there is a hierarchy of abstractions. An ``item may have different ``manifestations, with each ``manifestation belonging to an ``expression, and the ``expression is a subset of a ``work. It seems that the verbal definition has difficulty helping a user without library cataloging experience to understand what the cataloger means by these words.

An Example Dealing With Scientific Equivalence of Different Representations

Bruce next explained an example which showed two bitmap files. Each file had a header that contained a palette associated with eighteen IGBP categories of vegetation type. The data is contained in an array that follows the header, where each byte contains a binary number from 1 to 18. The pixel values in the array were representing equal areas on the Earth's surface -- 1,024 bins in longitude and 512 bins in sin(colatitude). The area respresented by a pixel was about 982 km2. In one representation, the left edge of the map was at -18 degrees, allowing the Sahara Dessert to appear as a contiguous area and placing the Pacific Ocean in the center of the diagram. Also, the North Pole was at the top of the map in its usual location. The ocean IGBP type had a palette that made the oceans appear in blue. The other bitmap file put the left edge of the map at -180 degrees of longitude, similar to many standard maps. However, the South Pole now appeared at the top of the map and the palette made the oceans appear to be red. These bitmaps have data arrays with apparently equivalent scientific data, although the order of the numerical values has been permuted.

The question was would a user who was familiar with bitmaps and could find or create software to use their data values to obtain the same number of pixels whose byte value was `3', the encoding of the IGBP type for `broadleaf deciduous forest?'

Curt said these might be two different manifestations of the same work. Bruce showed they were still the same mathematically in spite of the shockingly different appearances. He discussed developing an algorithm to show this as well. One issue is what we do with this equivalence when the the archive repackages the bitmap with a different palette. Should you keep them separate? Ruth said this is why you need provenance. Curt said it was two different manifestations of the same work according to FRBR. ASCII and HDF have a similiar issue.

Bruce asked about the policy issues in archives for this as far as users are concerned? The slides contain some thoughts on these issues.

Curt said you should uniquely identify each one and capture the provenance relationship between them. But Bruce said that doesn't capture authenticity. Donald said his inclination is to use appraisal and go back to the data provider and ask for clarification on the item. Bruce gave an example of a discussion of spatial grids in EOSDIS that took two years and ended in deadlock.. Curt referred back to the bitmaps and asked if they were two different types. Donald said that one bitmap would be a" translation" of the other, but They would be scientifically equivalent. One of the two manifestations might be more long term archivable then the other, though both are valid.

Curt said if an author used one in a publication, should they cite the archival one or the one they actually used? Donald said there should be one citation which would cover both representations. Curt said he would cite the one that he got. Bruce talked about hurdles and problems with inauthentic data sets based on obsolescence. Ruth said the existing citation guidelines cover these cases. If they went with the manifestations path, the author could use one. If he or she had both, either would work just fine. Bruce said it sounded like different dialects.

Archives with digital files that are scientifically equivalent - see slides for discussion points. Another example used monthly average precipitation data from rain guage stations.

Curt mentioned the need for both algorithms when doing the test. He mentioned a suggested approach that used cryptographic digests. Bruce countered that this approach would not work on the two bitmap examples.

Summary

Summary - categorizing stable objects in an archive - see slides for more details. These things might be more useful to users then allowing inauthenticity to creep in. And more work is needed on taxonomies to see if these algorithms would improve user search efficiency. Curt sid the second point is really hard. you might have to specify a domain of discourse. Donald added that the annotation might need to reference which user community was the reference for the term.

Ruth mentioned two papers by altman and renair, where they talk about scientific equivalence in digital curation.

Curt mentioned provenance equivalence for a next step.

Next Steps

Bruce asked what is the next step? Ruth and Curt said there is a need for a concrete outcome to move forward. Donald said he is not sure which problem we are trying to solve from a broader sense. They are already addressing some of these things at his organization but where are we trying to take the discussion outside of ESIP?

Curt said we needed use cases for citations. How do you know people are using the same data when gathered from different locations? Can an algorithm verify scientific equivalence? Data collections that change over time present additional complications. Identifiers for objects and collections present their own difficulties, including being able to go back to access collections at different dates and find out which objects and collections the archive had on those dates etc.

Bruce said that the group would continue to try to resolve the issues by e-mail. [At the business meeting for the Data Stewardship and Preservation Cluster on Friday morning, Bruce suggested that he would try to create two articles to be submitted to one or more journals. One paper would be on the object classification and accounting issues. The second would be on the issues raised by digital object replication and scientific equivalence. A likely publication venue would be the CODATA Data Science Joural, which apparently does not require page charges and allows authors to retain copyrights.]

Acknowledgements

Bruce appreciated Sarah Ramdeen's help in creating a useful set of notes during this session. He was also grateful for the help provided by Erin Robinson, her staff of helpers, and the hotel for providing equipment and other facilities for this discussion.