Michael Decker (MDecker)

From Earth Science Information Partners (ESIP)
Revision as of 10:14, August 23, 2011 by Michael Decker (MDecker) (talk | contribs) (-- ~~~~)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

-- Michael Decker (MDecker) 10:14, 23 August 2011 (MDT)

Data Servers: Technical Realization (IT) Issues and Solutions --- Summary

Issues re. the use of netCDF and other data formats

netCDF is standard format for multi-dimensional data. Cf-netCDF is used both as an archival format of grid data as well as a payload format for WCS queries.

  • Issue: ambiguity and completeness of CF
    • development of a server independent (python) CF-API library
    • Brainstorming: What is missing in CF?
    • Issue: CF (udunits) time format not the same as ISO Time format (as used by WCS)
      • those two cases can be processed with different code, but uniformity would be less confusing
      • could try to get ISO time recommendation into CF; would still need different code because it's only a recommendation
    • Issue: geo-referencing
  • Issue: We should define a standard NetCDF python interface (PyNIO, python-netcdf4, scipy.io.netcdf?)
  • Issue: other output formats
    • support fused into server or add-on concept (possibly using the public W*S/NetCDF interface)
    • Delivery of (small) data sets in ASCII/csv format?
  • Issue: Reading other gridded input data formats? (i.e. GRIB)
  • Issue: traceability and revision tracking of datasets (in WCS metadata as well as in NetCDF metadata)

Server co-development tools, methods

Server code is maintained through SourceForge (bugtracker, tar balls), Darcs code repositories are available at WUSTL and in Juelich.

  • Issue: Version control
    • maintain a common codebase
  • Issue: Documentation
    • mainly inline tech documentation to date
    • need more documentation regarding operation
    • proposal: sphinx for proper documentation (easy to include inline tech doc)

Use of WMS, WCS, WFS .. in combination?

Data display/preview is through WMS. AQ data can be delivered through WCS, WFS. In AComServ, WCS for transferring ndim grid and point-station data; WFS for deliver monitoring station descriptions.

  • Issue: WMS interface for preview; "latest" token for dynamic links?
    • generic WMS service operating on external wcs
    • "latest" token could be realized on WCS-WMS interface by using the metadata on the WMS/client side and requesting the latest time; latest time could be default response of WMS server if nothing else requested

Gridded data service through WCS

WCS is implemented in multiple versions: 1.0, 1.12, 2.0. The AQ Community Server (AComServ) is now implemented using WCS 1.1.2. This generally works well.

  • Issue: serve "virtual" WCS datasets with continuous time line assembled from many source files
    • clients should only have to do one query to receive the whole times series in one piece instead of requiring the client-side logic to request multiple pieces
    • could create a "wrapper" module that can handle such cases with knowledge of the server-side file structure
      • Kari has already done something like this for HTAP datasets, this could be a starting point
  • Issue: desirable time filtering options in WCS: hour of day, day of week, day of month, etc.
  • Issue: Extraction of vertical levels
    • already defined through the rangesubset/fieldsubset parameter
    • is this definition OK for us or would we need something else/better?
      • potential problem: only enumeration of levels possible, no ranges
  • Issue: current state of WCS 2.0? core released, but extensions still in draft (how do we know/keep track of what is currently valid?)

Delivery of station-point data

  • Issue: use WCS or WFS, Combination of both?

Access rights

  • Issue: technical options to restrict access to datasets?

Data server performance issues/solutions?

Define performance issues, measurements, ideas

  • Issue: especially big datasets take a long time to prepare for delivery (slicing/subsetting, etc.)
    • direct streaming of datasets to the client could be part of the solution, click here for details
    • generated datasets could be cached for a while, so they could be delivered again when there is a request with compatible parameters
    • problem: both proposals might be mutually exclusive to some degree
  • Issue: management overhead when opening NetCDF
    • when opening a NetCDF file, some metadata has to be read and data structures have to be set up
      • input files could be kept open for a while to avoid this overhead
  • Issue: temp file space is limited on WCS server
    • streaming approach for store=false parameter would not require additional local storage
    • temp file approach for store=true parameter could be limited by a maximum dataset size
      • requires a reliable output file size estimator
      • server would return an exception if estimated size is over given threshold
      • would force people to use store=false for large datasets
      • should not violate WCS 1.1 standard (too badly) as only store=false is mandatory
  • Issue: XML Metadata assembly might take a long time depending on the catalog content, i.e. with a lot of Identifiers
    • GetCapabilities response Metadata is very static anyway, other responses (DescribeCoverage) could be cached for a while
      • attention: DescribeCoverage response depends on parameters
    • minor issue compared to actual data delivery performance

Relationship to non-AComServ (non-NetCDF) WCS servers

  • data format(s)
    • most WCS clients don't understand NetCDF
  • Issue: protocol compatibility
    • might need to implement more optional features of WCS
  • standard compliance
    • will need a test suite for 1.1.2 (and manage to run it)

Linkages to non-WCS servers

  • Issue: is there a need?
  • Issue: which protocols? (OpenDAP?, GIS servers?)


-- Michael Decker (MDecker) 10:03, 23 August 2011 (MDT)

CF-API:

Performance/Virtual Datasets

  • non-compressed data preferred
  • many files vs. single file for queries
    • mapping: many files -> single identifier
      • Kari: might be too slow
      • Michael: should not matter so much for performance
      • queries might get very large
      • need to limit query size on server side (datafed browser: client side management currently)

Common NetCDF Python Interface, NetCDF4

  • Kari cloned PyNIO interface for Windows, so no problem right now for cross platform development
  • solve other problems first, keep an eye open
  • NetCDF4 makes things more complicated, might not be mappable to WCS easily

Delivery of other data formats, other input formats

  • need to map other formats to WCS and/or CF concept
  • differentiate between format (NetCDF) and convention (CF)
  • chain with WMS server for default views/previews

Tracability and revision tracking of Datasets

  • always try to get current data when dealing with real time data, always expect your data to be old
  • would be nice to have WCS field for "last updated" date, same for NetCDF/CF (global attribute?)
    • we can make something up on our own for a start
    • try to propose that for CF (and WCS)

Delivery of Point Station data

  • put config into SQL database as much as possible (views, stored procedures, etc)
    • try to maintain unit tests for this

Access restrictions to WCS

  • HTTP Basic authentication
  • API key
  • does not have to be 100% secure, more about connecting with the users, knowing who they are
  • firewalling for small user groups

Relationship with other Servers

  • write a wrapper for other data formats

WCS 2.0

  • more modular, core and extensions
  • potentially easier to use/implement
  • CF-NetCDF extension coming

Processing Services

Time filtering

  • day of week, hour of day, day of month,...
  • describe non-standard features in capabilities document?
  • might be difficult to get into official standard?
  • does not interfere with standard if you don't use it