Talk:Provenance and Context Content Standard

Ctilmes 11:46, 3 March 2011 (MST)

Taken from Rama's email: http://rtpnet.org/pipermail/esip-preserve/2011/000428.html

Clynnes 06:01, 7 March 2011 (MST)

Clynnes: I would suggest you add a column noting possible restrictions on later distribution of items. This would include source code release restrictions, as well as ITAR restrictions on most documents describing spacecraft details, for example.

Siri-Jodha Singh Khalsa 14:19, 8 March 2011 (MST)

From email. Some questions and comments:

1. I didn't see a mention of the data and metadata itself. This archive package would be separate from the data? The division between documentation and metadata is blurred, of course, e.g. production history is included in the table, but is in some cases saved as metadata. Some QA is metadata, some is in the product files. The spec needs to address how data and metadata are preserved.

2. Normalization of the entries is needed - there's lots of duplication/overlap, and some similar or identical items appear under different categories

3. Some descriptions refer to "project artifacts" - the reference should be to specific items in the table. Likewise, what are "developer's design documents"?

4. There needs to be a separate column for "alternate" sources for an item, rather than using the rationale column, which should eventually be populated for each item.

5. Aren't items that refer to accessing the data, like NOAA item 36, irrelevant? The physical location of the data may change.

6. Error analysis under documentation belongs under Quality (and I think QA is too narrow of a term. Should be just quality or quality measures). Validation belongs under Quality also, don't you think?

7. Why source code only for higher-level products? Why not all source code involved in product generation?

Bruce Barkstrom 05:16, 9 March 2011 (MST)

[email]

I'll probably have some longer comments later, but here's a bit more on the context of the NASA project environment that may help explain the tabulation. First, the concern that John Moses has been wrestling with is what to do with items at the end of a government project. The instrument developer is usually a contractor and the documentation items would need to be specifically identified as deliverables in the contract. They may consist of documents or collections of data, which might be in the form of files or in databases developed by the contractor. It may be that the full collections are provided to the government institution that oversees the contract. The spreadsheet is intended to assist first in dealing with items in the contracts that will be disposed of if no action is taken very soon. It is not clear that the relevant institutions or organizations that deal with the metadata and data produced by the project have any knowledge of this contractually deliverable information. Indeed, the term 'metadata' and the finer distinction between documents, data, and contextual information we usually call metadata may not have any meaning to the instrument project or the instrument contractor.

In my experience (where the dialect I'm accustomed to may not be equivalent to that of other communities), QA and QC would usually refer to automated elements of the production software that produce reports (usually files with text and numerical values) that are usually produced routinely as part of each file's production job. The number of these can be quite large - as might be expected from production rates on NASA EOS missions being 5,000 to 100,000 jobs being run per day over ten to twenty years. It is not clear that the science teams that provide the software or the EOS data centers have ever made any plans to archive these reports. QA and QC might also include the diagnostic work and production changes associated with discovering anomalies in the reports and developing fixes. While these anomalies and fixes are likely to appear in action item lists prepared by the production teams, these items are also likely to be discarded or on the list of things not included in the archive accession lists for permanent retention.

'Validation' and 'Calibration' also have many variant meanings. In my experience, calibration refers to the activities and procedures that are used to convert the raw data into geophysical units. For example, a conversion from digital counts to calibrated radiances would use calibration coefficients - as well as an algorithm embedded in source code that may also perform statistical filtering to remove "bad values". The calibration data for the instruments I'm familiar with may come from pre-flight facilities on the ground, or from in-flight sources that are part of the instrument. To complicate the discussion further, some kinds of calibration algorithms use fairly complex radiative transfer models that use information on surface reflection and atmospheric constituent distributions to model the radiances arriving at the instrument in orbit. In all three of these calibration methods, there are complex chains of processing and fairly large amounts of data and documentation that may be involved.

In contrast, validation is more likely to involve the activities, processes, and data that work on comparing geophysical parameters from different measurement sources and are highly likely to involve the more complex processes that convert the calibrated values (of the previous paragraph) to other kinds of geophysical quantities. For the experiments with which I've been involved, calibrated radiances are the lowest physically useful kind of data. Most of the validation involved processes that attempted to compare fluxes from our instruments with fluxes from other instruments (radiance being a quantity that describes energy transport along a light ray, while a flux describes energy being transported through an area for all directions), or with determinations of cloud cover or properties. In addition, validation often involves areal and temporal averages - say going from 20 km diameter footprints to 2.5 degree regions and from instantaneous observations to monthly averages. Again, the processes, data, and documentation for these validation exercises may be voluminous and have as much storage volume as the data itself.

Of course, these comments reflect my experience with a couple of large projects. Other communities may have different mental models and different vocabularies for describing them.

As far as the source code question, the governmental and project context may govern the response to the question. In some projects, particularly NOAA, the data reduction software may be done by government contractors - and it may be difficult (not to mention expensive) to get the contractor to agree to formally release the software for more public use. In other cases (NASA EOS being one), the data production software may have been required to be made available as part of the agreement by the producers that create the data and the documentation. Source code is difficult enough (some of the code bases are substantial fractions of a million lines of code - or even more). We'd also have to get the procedures and production histories that may (or may not) have been part of the contractual agreements. What used to be NPOESS may or may not have included these items in the contract, for example. As to higher level products vs lower level code, we'd have to see whether the contractor regarded the code as proprietary. I do recall having a great deal of difficulty getting the source code out of the ESDIS contractor to deal with the geolocation algorithms that used DEMs.

Anyway, hope these comments are useful in understanding the context. It isn't easy to provide all of the context when you're trying to engage in an information rescue mission, as John Moses is right now.