Questions and Comments about CF-1.6 Station Data Convention

From Earth Science Information Partners (ESIP)
Revision as of 06:02, August 31, 2011 by Hoijarvi (talk | contribs)

Back to WCS Wrapper

The objective of this page is to promote discussion about CF-netCDF formats for station point data.

CSV, Comma Separated Values: Its Uses and Limitations

CSV format is easy, compact and practical for many uses.It can be fed to any spreadsheet and consumed easily with custom software.

Example of WCS GetCoverage query returning an envelope with uri to the result CSV file

   loc_code,lat,lon,datetime,pm25nr, etc...
   350610008,34.81,-106.74,2007-05-12T00:00:00,3.8, etc...
   350439004,35.62,-106.72,2007-05-12T00:00:00,20.9, etc...
   350610008,34.81,-106.74,2007-05-12T01:00:00,6.9, etc...

Unfortunately, the format makes embedding metadata in a CF-like convention difficult.

  • Location dimension:
    • Incompleteness: No idea what else is known about the stations than loc_code, lat and lon.
    • Stations with no data may be present with NULL value, or may not be present at all. No indication what's the case.
    • Repetition of the same latitude and longitude values.
  • Time dimension:
    • What's the periodicity? The software needs to guess, that the sample query coverage AQS_H actually is hourly data.
    • Are all the locations in the same periodicity, or do the locations have individual recording times? Again a guess.
    • What was the requested time min + max, and what's the real returned time range.

There's field pm25nr but there's more about it. It's PM 2.5 Non-Reference Method, units are ug/m3, source is from EPA Air Quality Network.

CF-netCDF for the rescue.

Existing CF-netCDF Conventions

At datafed, we have supported station data before CF-1.5 and WCS 1.1.0 were official. This was implemented in the older WCS 1.0.0 standard and has not been incorporated into the WCS 1.1 service yet.

Sample Query to AQS_H returning CF-netCDF station time series

Same as above, but returning only the CDL header file without data

The above implementation is still lacking what we really need, so let's use it just as a proof of concept: it is possible to pack station data into a CF-NetCDF and it's more expressive than CSV. It's also created from an obsolete CF-1.5 draft, so it's incompatible with the real CF-1.5 document.

CF 1.5 and Station Data

The section 5.4. Timeseries of Station Data shows a more recently accepted coding of station data.

That section shows almost what datafed is producing. Simplifying the data by removing the pressure dimension:

   dimensions:
       station = 10 ;  // measurement locations
       time = UNLIMITED ;
   variables:
       float humidity(time,station) ;
           humidity:long_name = "specific humidity" ;
           humidity:coordinates = "lat lon" ;
       double time(time) ;
           time:long_name = "time of measurement" ;
           time:units = "days since 1970-01-01 00:00:00" ;
       float lon(station) ;
           lon:long_name = "station longitude";
           lon:units = "degrees_east";
       float lat(station) ;
           lat:long_name = "station latitude" ;
           lat:units = "degrees_north" ;

Properties of the data:

  • Stations:
    • Stations are indexed along station dimension.
    • Each station has lat and lon.
  • Time:
    • Measurement time is enumerated and is consistent across the locations.
    • Time is the unlimited dimension, so new measurements can be appended.
  • Data
    • Only the data is stored, very efficient transport mode for densely packed data.
    • Times and stations are not stored physically with each data point, since they can be computed from the index.
    • Missing data could be indicated by adding missing_value attribute.

CF 1.6 Draft and Station Data

The section H.2. Time Series Data shows a more recently accepted coding of station data.

   dimensions:
       station = 10 ;  // measurement locations
       name_strlen = 15 ; // location name
       time = UNLIMITED ;
   variables:
       float humidity(station,time) ;
           humidity:standard_name = "specific humidity" ;
           humidity:coordinates = "lat lon alt" ;
       double time(time) ; 
           time:standard_name = "time";
           time:long_name = "time of measurement" ;
           time:units = "days since 1970-01-01 00:00:00" ;
       float lon(station) ; 
           lon:standard_name = "longitude";
           lon:long_name = "station longitude";
           lon:units = "degrees_east";
       float lat(station) ; 
           lat:standard_name = "latitude";
           lat:long_name = "station latitude" ;
           lat:units = "degrees_north" ; 
       float alt(station) ;
           alt:long_name = "vertical distance above the surface" ;
           alt:standard_name = "height" ;
           alt:units = "m";
           alt:positive = "up";
           alt:axis = "Z";
       char station_name(station, name_strlen) ;
           station_name:long_name = "station name" ;
           station_name:cf_role = "timeseries_id";
   attributes:
       :featureType = "timeSeries";


Properties of the data:

  • Stations:
    • Same as above, with addition
    • Each station has station_name.
    • Each station has alt for altitude with units and direction.
  • Time:
    • Same as above, with standard_name attribute
  • Data
    • Same as above, with standard_name attribute

CF-netCDF Data Model extension specification (Draft)

This is an OGC draft paper edited by Ben Domenico and Stefano Nativi. It differs a little bit from CF-1.5 and 1.6

It contains similar timeSeries convention: a series of data points at the same spatial location with monotonically increasing times.

The data has the same two dimensions as above: station and time. But time also has the same two dimensions. This means, hat every station may have it's own time values, Time for observation at index 34 may change from station to station.

The example given is not consistent: the time variable has only one dimension.

   dimensions:
       time = UNLIMITED; // (5 currently)
       station = 10;
       nv = 2;
   variables:
       float pressure(time,station);
       pressure:long_name = "pressure";
       pressure:units = "kPa";
       pressure:cell_methods = "time: point";
   float maxtemp(time,station);
       maxtemp:long_name = "temperature";
       maxtemp:units = "K";
       maxtemp:cell_methods = "time: maximum";
   float ppn(time,station);
       ppn:long_name = "depth of water-equivalent precipitation";
       ppn:units = "mm";
       ppn:cell_methods = "time: sum";
   double time(time);
       time:long_name = "time";
       time:units = "h since 1998-4-19 6:0:0";
       time:bounds = "time_bnds";
   double time_bnds(time,nv);

Requirements Suggested by Datafed

All of the below are suggestions that are already in use at Datafed. Comments are welcome.

The station table should have any number of fields

The station (location) table at datafed has

  • Minimum 3 fields: loc_code, lat, lon
  • Optional standard fields: loc_name, elev
  • Any number of extended fields: state_code, county_code, etc...

This could be achieved easily with CF 1.6 encoding.

The problem is, that client needs to do some amount of guesswork to figure out the structure of the table. The station_name:cf_role = "timeseries_id"; attribute tells this is the primary key field in the station table, and humidity:coordinates = "lat lon alt" ; tells the rest of the fields.

Personally I think this is a confusing and not extensible convention. It would be better to code the table directly in a relational manner. Here's an example without the time or data variables and irrelevant properties:

   dimensions:
       station = 10 ;  // measurement locations
       code_strlen = 15 ; // station code
       name_strlen = 56 ; // station name
       time = UNLIMITED ;
   variables:
       float lon(station) ; 
           ... omitted
           lon:table_name = "station";
       float lat(station) ; 
           ... omitted
           lat:table_name = "station";
       float alt(station) ;
           ... omitted
           alt:table_name = "station";
       char station_code(station, code_strlen) ;
           ... omitted
           station_code:table_name = "station";
           station_code:primary_key = "T";
       char station_name(station, name_strlen) ;
           ... omitted
           station_name:table_name = "station";
   ... omitted

The station table has five fields: station_code, station_name, lon, lat, alt

Each field is marked with table_name = "station". Now it is possible to just find all the variables with that attribute.

The station_code is marked with primary_key = "T". For example CIRA/VIEWS database has codes like ACAD and names like 'Acadia National Park'.

The CF-1.6 has attribute coordinates = "lat lon alt" which is a little confusing. The question here is whether lat, lon and alt are dimensions or attributes of a dimension. In gridded data lat, lon and alt) are dimensions. In station data they are attributes of a station.

Fields should be Mandatory, Optional, and User-defined

  • The convention must define at least three fields for station table: station_code, lat and lon.
  • Since most networks have the same metadata, there needs to be a convention for optional columns
    • station_name
    • altitude or depth (already exists in CF)
  • Providers should be able to add any data columns into the station table. For example country, county_code, state, start_date, end_date

Mandatory columns are needed so that client software can draw the data on the map. Optional columns enable the clients to display the station name with the code, when necessary. User-defined columns can be displayed, used as a filter, or just ignored.

Multiple Tables Should be Allowed

This encoding allows us to store any amount of tables into a single netcdf file, enabling future extensions. For example the CF-1.5 trajectory encoding only allows one trajectory. Datafed is using an internal format for trajectories, with multiple stations having each individual trajectory. Such encoding is a trivial relational schema, and the above table definition can be used as is.

Encoding tables should be should be totally separate from anything else. This would enable generic software to translate 1:1 between any relational store and netcdf. After that, experimental CF conventions could be tested using existing table encodings. There are three levels of these standards:

  • Format: NetCDF and HDF are both formats. We're discussing about NetCDF, but HDF is very similar and could be used as format also.
  • Convention: CF is a convention.
  • Schema: A single table of station data, as CF-1.6 does not need a schema. But if we want to encode time series of trajectories from multiple stations, we need a schema that standardizes how stations, trajectories, and trajectory points are encoded.

CF-1.6 combines both convention and schema together, when they should be separated.

Encoding Relational Tables in NetCDF

Encoding Relational Tables to NetCDF

  • There must be a dimension for the table, to enumerate the records.
  • A column is a variable with that dimension and an attribute table_name
  • A table is made of columns with the same table_name attribute.
  • Primary keys are declared with primary_key = "T" attribute. A table can have multiple keys as a primary key
  • Foreign keys are declared with either using the dimension of another table, or declaring a field to be a reference to another table.

Examples:

Here's how you encode a minimal station table, with all the CF attributes omitted:

   dimensions:
       station = 10 ;  // station table dimension
       code_strlen = 4 ; // artificial dimension for station code, with max four character length.
   variables:
       float lon(station) ; 
           lon:table_name = "station";
       float lat(station) ; 
           lat:table_name = "station";
       char station_code(station, code_strlen) ;
           station_code:table_name = "station";
           station_code:primary_key = "T";

The stations have now index 0-9.

Since we have data with time dimension, let's add the time table, again omitting the CF attributes like units = "days since 1970-01-01 00:00:00"

   dimensions:
       ... as above
       time = UNLIMITED ;
   variables:
       ... as above
       double time(time) ; 
           station_code:table_name = "time";
           station_code:primary_key = "T";

Time can be indexed just as station, 0..(count-1).

So far we have the two unrelated tables for our two dimensions: location and time. Now let's add humidity and temperature, which refers to both of the dimensions:

   dimensions:
       ... as above
       time = UNLIMITED ;
   variables:
       ... as above
       float humidity(time, station) ;
           station_code:table_name = "data";
       float temperature(time, station) ;
           station_code:table_name = "data";

This is identical to CF-1.6 encoding. Both variables are 2-dimensional cubes.

The humidity and temperature are now both in the table named data, which has four fields: time, station, humidity and temperature

This is a very compact encoding if every station has data for every time step, but wasteful if data is sparse, since the cubes will then contain mainly null values. That's why CF drafts have had encodings like ragged arrays.

With multiple tables there's an easy way to encode sparse tables. Instead of above, we'll describe the record dimension:

   dimensions:
       ... as above
       data = 800 ; // 800 records
   variables:
       ... as above
       int data_station(data) ;
           data_station:table_name = "data";
           data_station:foreign_key = "station";
       int data_time(data) ;
           data_station:table_name = "data";
           data_station:foreign_key = "time";
       float humidity(data) ;
           humidity:table_name = "data";
       float temperature(data) ;
           temperature:table_name = "data";

The table still has the four columns, but the encoding is not a cube anymore. The time and station dimensions are not netcdf dimensions, but variables like data_station which contains an index to the station table.

To print out a row 5 from data table:

  • read the data_station(5). It may contain number 8, which is referring to the station 8 in the station table.
    • read station_code(8,all), lat(8) and lon(8)
  • read the data_time(5). It may contain number 112, which is referring to the time 112 in the time table.
    • read time(112), which contains number 35523, which means 35523 days from 1970-01-01
  • read humidity(5) which represents itself.
  • read temperature(5)

This is a simple and powerful encoding.

  • Any typical schema from SQL-databases can be encoded.
  • If the data is dense, foreign keys can be expressed as dimensions, resulting in vastly smaller files than flat CSV would be.
  • If the data is sparse, foreign keys can be expressed as variables.
  • There can be any combination of the two, enabling the best of both worlds whatever your data will look like.

Encoding Relational Schemas to NetCDF

A generic client can now read tables and their relations, but unless more knowledge is available, there's no way to know what that thing actually means. The table encoding gives syntax, but not the semantics. The WCS conventions need to come up with standard schemas, that standardize how metadata is encoded.

The example here is CIRA/VIEWS where the schema is a typical star schema, and each measurement refer to several metadata tables:

  • Parameter
  • Aggregation
  • Method
  • DataSource
  • Program
  • Site

The encoding of such tables is straightforward. Using CIRA/VIEWS nomenclature

   dimensions:
       // dimensions for max name lengths
       site_code_length = 4 ;
       site_name_length = 50 ;
       program_code_length = 8 
       etc...
       // dimensions for metadata tables
       time = 800
       site = 80
       aggregation = 3 // just a few aggregation methods are available
       program = 5 // similarly, the program table is small
       etc...
       // main data dimension:
       data = UNLIMITED


   variables:
       // variables for metadata tables
       char site_code(site, sire_code_length) ;
           site_code:table_name = "site";
           site_code:primary_key = "T";
       float lon(site) ; 
           lon:table_name = "station";
       float lat(site) ; 
           lat:table_name = "station";
       about 10 more fields for site...
       char program_code(program, program_code_length) ;
           program_code:table_name = "program";
       char program_name(program, program_name_length) ;
           program_name:table_name = "program";
       five more fields....
       six more tables...
       float AirFact3_AirFact(data) ;  // data variable
           AirFact3_AirFact:table_name = "AirFact3";
       int AirFact3_Parameter(data) ;  // refers to parameter table (Alf, SO4f, MT, Mf etc...)
           AirFact3_Parameter:table_name = "AirFact3";
       int AirFact3_Site(data) ;  // refers to Site
           AirFact3_Site:table_name = "AirFact3";
       int AirFact3_Time(data) ;  // refers to Time
           AirFact3_Time:table_name = "AirFact3";
       int AirFact3_Program(data) ;  // refers to Program
           AirFact3_Program:table_name = "AirFact3";

Time dimension should have periodicity encoded

todo

History shoule be more than single comment

todo