From Earth Science Information Partners (ESIP)

Big Data Science Use Case: spatio-temporal data in the Earth sciences

Science Driver:

In the Earth sciences, spatio-temporal data are coming not only in high Volume and Velocity, but particularly in high Variety: regularly and irregularly gridded data, TINs, meshes; 1-D sensor timeseries, 2-D satellite images, 3-D x/y/t image timeseries and x/y/z geological models, 4-D x/y/z/t climate simulation output, etc.

A useful abstraction in this context is the notion of a coverage as defined in ISO 19123 (identical to OGC Abstract Topic 6, which is free for download). A coverage can be seen as the digital representation of a space/time varying phenomenon, such as remote sensing imagery and climate data. While this definition is too abstract to be interoperable, the OGC GML 3.2.1 Application Schema - Coverages (nicknamed GMLCOV) establishes a concrete, interoperable coverage data structure which can be encoded in GML, GeoTIFF, NetCDF, JPEG2000, and any other suitable data format. This coverage data definition is independent from the service and can be used in WCS, WMS, SOS, WCPS, WPS, SWE, etc.

The OGC Web Coverage Service (WCS) suite establishes a modular set of services, ranging from simple subsetting (WCS Core) up to ad-hoc querying and processing multi-dimensional raster coverages (Web Coverage Processing Service, WCPS).

In the EarthServer initiative, several key issues are addressed to get to grips with such data. Six Lighhouse Applications are establishing 100+ TB services whose interfaces strictly rely on OGC interfaces. The server platform is given by the rasdaman Array DBMS.

Data Characteristics

  • Data Set Names: n.a.
  • Volume = from a few 100 MB to multi-PB archives
  • Data Type = [Array, Tabular, Text etc]
  • Heterogeneity:
    • regular grids, irregular grids, and mix thereof (ortho image timeseries!)
    • grids and meshes
    • 1..5D
    • spectral bands
    • different formats, metadata richness, etc.

  • Format:
    • CVS, GML, etc. for 1-D sensor timeseries
    • GeoTIFF, JPEG2000, PNG, etc. for 2-D imagery
    • NetCDF, HDF5, GRIB2, etc. for 4-D, 5-D climate data
    • name it...

Analysis Needs

  • subsetting (trimming a 2D map, slicing a 3D image timeseries, ...)
  • processing ("NDVI derived from some hyperspectral scene", "Fourier transform of some ocean image")
  • filtering ("which data set is of interest?")
  • combining (eg, sensor streams with long-tail data)
  • from ad-hoc access to long-term archive retrieval ("long tail")