Preservation Use Case Reproducing a dataset

From Earth Science Information Partners (ESIP)

Reproducing a dataset[edit | edit source]

Summary[edit | edit source]

An instrument PI has developed an instrument to capture FOO data and has created a dataset based on that data (see Use Case XXX). It has been validated (see UC XXX) and published through an archive (UC XXX). The researcher may or may not be able to download the original data content/binary/source/etc. (see subcases), but wants to re-create the original dataset prior to doing something else with it (comparing with his own? verifying published science paper? preparing to introduce changes into the software? supporting IPCC AR16? Disaster recovery?)

Sometimes the input files themselves are not available, so this use case can be exercised recursively. For example, one can reproduce level-3 data from level-2, or one might want to instead go back to level-1b or level-0, stepwise making each input dataset.

Subcases[edit | edit source]

  • Data availability
    • the data content is available
    • the data content is not available
  • Software availability
    • Source code available
    • Source code not available, design documents available
    • Binary available and capable of running on an accessible platform

Actors[edit | edit source]

  • Researcher
  • Archive

Sequence of Events[edit | edit source]

  1. Obtain dataset provenance information
    1. List of input files
    2. Runtime parameters
  2. Obtain needed inputs
    1. Data files
    2. DAP (Delivered Algorithm Package)
  3. Obtain binary capable of producing dataset
    1. Available
    2. Not available
      1. Source code available
        1. Need to know how to build it: environment, compilers, compiler options, libraries
        2. Compile source code
        3. Prove you did it right.
          1. Re-run Unit test case (e.g. MODIS golden month)
      2. Source not available
        1. Reproduce from design documents or ATBD
          1. goto <Source code available>
  4. Obtain framework for running the code and making the dataset
    1. Available?
    2. Collaboratory?
    3. Port code into your own framework?
  5. Run the code to produce the dataset

PCCS Artifacts:[edit | edit source]

  1. Provenance description of how dataset was made
    1. List of input data files
      1. Level-? data
      2. Ancillary data
    2. List of runtime parameters
  2. Input data files
  3. Execution environment description
    1. OS version
    2. Hardware description
    3. System library versions
  4. DAP
    1. Executables
    2. Source Code
    3. Build environment description
      1. OS version
      2. Compiler version
      3. Library versions
      4. Unit test data
  5. Algorithm design documents / ATBD