Questions and Comments about CF-1.6 Station Data Convention
The objective of this page is to promote discussion about CF-netCDF formats for station point data and beyond. After commenting the existing standards, best practices and drafts, we propose that the convention for representing station data should done on three levels:
At the top level, there is a CF-convention. Currently CF-conventions are mainly for gridded data but we have more needs:
- Station point data (already CF-1.5 and up)
- Station trajectory data
- Aeroplane data
- Lidar data
The middle level: Standard to encode relational schemas. This would enable us to tinker with experimental ideas how to encode the data and metadata. Popular schemas could then be standardized.
The Bottom is data format, like netCDF and HDF. We're only discussing about netCDF, but HDF is very similar and could be used as format also.
The rational to add the relational schema encoding is to re-use it later. A single table of station data, as CF-1.6 describes it, does not need much of a schema. But if we want to encode time series of trajectories from multiple stations, we need a schema that standardizes how stations, trajectories, and trajectory points are encoded. With the relational schema encoding it's possible to create a library that operates on table level, hiding the encoding from the application programmer.
The middle level is documented here: Encoding Relational Tables in NetCDF
- 1 Existing Conventions
- 2 Requirements Suggested by Datafed
CSV, Comma Separated Values: Its Uses and Limitations
CSV format is easy, compact and practical for many uses.It can be fed to any spreadsheet and consumed easily with custom software.
loc_code,lat,lon,datetime,pm25nr, etc... 350610008,34.81,-106.74,2007-05-12T00:00:00,3.8, etc... 350439004,35.62,-106.72,2007-05-12T00:00:00,20.9, etc... 350610008,34.81,-106.74,2007-05-12T01:00:00,6.9, etc...
Unfortunately, the format makes embedding metadata in a CF-like convention difficult.
- Location dimension:
- Incompleteness: No idea what else is known about the stations than loc_code, lat and lon.
- Stations with no data may be present with NULL value, or may not be present at all. No indication what's the case.
- Repetition of the same latitude and longitude values.
- Time dimension:
- What's the periodicity? The software needs to guess, that the sample query coverage AQS_H actually is hourly data.
- Are all the locations in the same periodicity, or do the locations have individual recording times? Again a guess.
- What was the requested time min + max, and what's the real returned time range.
There's field pm25nr but there's more about it. It's PM 2.5 Non-Reference Method, units are ug/m3, source is from EPA Air Quality Network.
To address these shortcomings we check CF-netCDF.
Station Data at Datafed with CF 1.5 draft
At datafed, we have supported station data before CF-1.5 and WCS 1.1.0 were official. This was implemented in the older WCS 1.0.0 standard and has not been incorporated into the WCS 1.1 service yet.
The above implementation is still lacking what we really need, so let's use it just as a proof of concept: it is possible to pack station data into a CF-NetCDF and it's more expressive than CSV. It's also created from an obsolete CF-1.5 draft, so it's incompatible with the real CF-1.5 document.
CF 1.5 and Station Data
The section 5.4. Timeseries of Station Data shows a more recently accepted coding of station data.
Table names in CDL are omitted for clarity.
That section shows almost what datafed is producing. Simplifying the data by removing the pressure dimension:
dimensions: station = 10 ; // measurement locations time = UNLIMITED ; variables: float humidity(time,station) ; long_name = "specific humidity" ; coordinates = "lat lon" ; double time(time) ; long_name = "time of measurement" ; units = "days since 1970-01-01 00:00:00" ; float lon(station) ; long_name = "station longitude"; units = "degrees_east"; float lat(station) ; long_name = "station latitude" ; units = "degrees_north" ;
Properties of the data:
- Stations are indexed along station dimension.
- Each station has lat and lon.
- Measurement time is enumerated and is consistent across the locations.
- Time is the unlimited dimension, so new measurements can be appended.
- Only the data is stored, very efficient transport mode for densely packed data.
- Times and stations are not stored physically with each data point, since they can be computed from the index.
- Missing data could be indicated by adding missing_value attribute.
CF 1.6 Draft and Station Data
The section H.2. Time Series Data shows a more recently accepted coding of station data.
dimensions: station = 10 ; // measurement locations name_strlen = 15 ; // location name time = UNLIMITED ; variables: float humidity(station,time) ; standard_name = "specific humidity" ; coordinates = "lat lon alt" ; double time(time) ; standard_name = "time"; long_name = "time of measurement" ; units = "days since 1970-01-01 00:00:00" ; float lon(station) ; standard_name = "longitude"; long_name = "station longitude"; units = "degrees_east"; float lat(station) ; standard_name = "latitude"; long_name = "station latitude" ; units = "degrees_north" ; float alt(station) ; long_name = "vertical distance above the surface" ; standard_name = "height" ; units = "m"; positive = "up"; axis = "Z"; char station_name(station, name_strlen) ; long_name = "station name" ; cf_role = "timeseries_id"; attributes: :featureType = "timeSeries";
Properties of the data:
- Same as above, with addition
- Each station has station_name.
- Each station has alt for altitude with units and direction.
- Same as above, with standard_name attribute
- Same as above, with standard_name attribute
CF-netCDF Data Model extension specification (Draft)
This is an OGC draft paper edited by Ben Domenico and Stefano Nativi. It differs a little bit from CF-1.5 and 1.6
It contains similar timeSeries convention: a series of data points at the same spatial location with monotonically increasing times.
The data has the same two dimensions as above: station and time. But time also has the same two dimensions. This means, hat every station may have it's own time values, Time for observation at index 34 may change from station to station.
The example given is not consistent: the time variable has only one dimension in the CDL example, but two dimensions in table 1, page 33.
dimensions: time = UNLIMITED; // (5 currently) station = 10; nv = 2; variables: float pressure(time,station); long_name = "pressure"; units = "kPa"; cell_methods = "time: point"; float maxtemp(time,station); long_name = "temperature"; units = "K"; cell_methods = "time: maximum"; float ppn(time,station); long_name = "depth of water-equivalent precipitation"; units = "mm"; cell_methods = "time: sum"; double time(time); long_name = "time"; units = "h since 1998-4-19 6:0:0"; bounds = "time_bnds"; double time_bnds(time,nv);
Requirements Suggested by Datafed
All of the below are suggestions that are already in use at Datafed. Comments are welcome.
The station table should have any number of fields
The station (location) table at datafed has
- Minimum 3 fields: station_code, lat, lon
- Optional standard fields: station_name, alt
- Any number of extended fields: state_code, county_code, etc...
This could be achieved easily with CF 1.6 encoding.
The problem is, that client needs to do some amount of guesswork to figure out the structure of the table. The station_name:cf_role = "timeseries_id"; attribute tells this is the primary key field in the station table, and humidity:coordinates = "lat lon alt" ; tells the rest of the fields. The CF-1.6 has attribute coordinates = "lat lon alt" which is a little confusing. The question here is whether lat, lon and alt are dimensions or attributes of a dimension. In gridded data lat, lon and alt are dimensions. In station data they are attributes of a station.
Fields should be Mandatory, Optional, and User-defined
- The convention must define at least three fields for station table: station_code, lat and lon.
- Since most networks have the same metadata, there needs to be a convention for optional columns
- altitude or depth (already exists in CF)
- Providers should be able to add any data columns into the station table. For example country, county_code, state, start_date, end_date
Mandatory columns are needed so that client software can draw the data on the map. Optional columns enable the clients to display the station name with the code, when necessary. User-defined columns can be displayed, used as a filter, or just ignored.
Multiple Tables Should be Allowed
Personally I'm mostly afraid, that this station data convention is not extensible. Many data providers would like to add additional metadata tables for parameter, station, organizations, instruments etc. Trajectories and lidar need their own convention, which cannot easily be extended from this.
Therefore, we suggest Encoding Relational Tables in NetCDF as a mid-level convention.
This encoding allows us to store any amount of tables into a single netcdf file, enabling future extensions. For example the CF-1.5 trajectory encoding only allows one trajectory. Datafed is using an internal format for trajectories, with multiple stations having each individual trajectory. Such encoding is a trivial relational schema, and the above table definition can be used as is.
Encoding tables should be should be totally separate from anything else. This would enable generic software to translate 1:1 between any relational store and netcdf. After that, experimental CF conventions could be tested using existing table encodings. There are three levels of these standards:
CF-1.6 combines both schema and table encoding together, they should be separated.