From Federation of Earth Science Information Partners
-- Brbarkstrom 07:39, 6 April 2011 (MDT)
Notes on Provenance
Here are two notes on provenance from a more formal point of view:
- Production History Provenance for Discrete Production
- If production is discrete, meaning that data production proceeds by using a batch processing paradigm that has jobs that ingest files and create them, then the history of production can be captured by creating a production history provenance graph. The nodes in the graph are either files or job residues. A job residue is the boiled down compilation of a transient activity and includes such information as the start time, the end time, and the machine configuration. In order for the production graph to make sense as a record of production, it needs to be a Directed Acyclic Graph (DAG), meaning that no loops are allowed. As a corollary, this also implies that each vertex in the graph can have a unique identifier (if the identifiers were not unique, then loops would be possible and you could not distinguish an input file from an output file).
- The requirement for production being a DAG has some implications for databases. In a sense, a database is like a file in that one must open either of these kinds of entities to use them and one must close them when finished. The data structures in databases are more rigidly typed than are the data structures in files. The key difference lies in the use of Structured Query Language (SQL) and related access mechanisms for databases that create the results sets output from databases. By design, databases are also more careful about retaining the ACID properties by controlling the update of the database contents. This design aspect of databases means that updates are intended to look like instantaneous events that operate on the internal state of the database. In other words, the state of the database is different after a transaction than it was before. From the standpoint of determining the state of a database, we could view the state as evolving through a sequence of transactions. Thus, to uniquely identify the state of a database, one could use an identifier like a UUID -- or one could simply record the sequence of states as a sequence of integers, where each transaction increments an integer. This approach would make it easy to build an auditable history of transactions that could be used for forensic purposes. Note also that the sequence of integers is equivalent to a time stamp -- albeit with a clock that has jitter because transactions do not occur with any enforcement of a uniform clock rate.