Difference between revisions of "Encoding Relational Tables in NetCDF"

From Earth Science Information Partners (ESIP)
(Created page with "[[Questions_and_Comments_about_CF-1.6_Station_Data_Convention#CF-netCDF_Data_Model_extension_specification_.28Draft.29| Back to Questions and Comments about CF-1.6 Station Data C...")
 
Line 1: Line 1:
 
[[Questions_and_Comments_about_CF-1.6_Station_Data_Convention#CF-netCDF_Data_Model_extension_specification_.28Draft.29| Back to Questions and Comments about CF-1.6 Station Data Convention]]
 
[[Questions_and_Comments_about_CF-1.6_Station_Data_Convention#CF-netCDF_Data_Model_extension_specification_.28Draft.29| Back to Questions and Comments about CF-1.6 Station Data Convention]]
 +
 +
== Rational Behind this Proposal ==
 +
 +
The WCS service has potential to grow beyond regular, gridded coverages. Coverage types that need to be described are for example:
 +
 +
* station point data
 +
* station trajectory data
 +
* lidar data
 +
* moving station data, like aeroplane or ship
 +
 +
A collection of relational tables
 +
 +
== Encoding Relational Tables to NetCDF ==
 +
 +
* There must be a dimension for the table, to enumerate the records.
 +
* A column is a variable with that dimension and an attribute ''table_name''
 +
* A table is made of columns with the same ''table_name'' attribute.
 +
* Primary keys are declared with primary_key = "T" attribute. A table can have multiple keys as a primary key
 +
* Foreign keys are declared with either using the dimension of another table, or declaring a field to be a reference to another table.
 +
 +
Examples:
 +
 +
Here's how you encode a minimal station table, with all the CF attributes omitted:
 +
 +
    dimensions:
 +
        station = 10 ;  // station table dimension
 +
        code_strlen = 4 ; // artificial dimension for station code, with max four character length.
 +
    variables:
 +
        float lon(station) ;
 +
            lon:table_name = "station";
 +
        float lat(station) ;
 +
            lat:table_name = "station";
 +
        char station_code(station, code_strlen) ;
 +
            station_code:table_name = "station";
 +
            station_code:primary_key = "T";
 +
 +
The stations have now index 0-9.
 +
 +
Since we have data with time dimension, let's add the time table, again omitting the CF attributes like ''units = "days since 1970-01-01 00:00:00"''
 +
 +
    dimensions:
 +
        ... as above
 +
        time = UNLIMITED ;
 +
    variables:
 +
        ... as above
 +
        double time(time) ;
 +
            station_code:table_name = "time";
 +
            station_code:primary_key = "T";
 +
 +
Time can be indexed just as station, 0..(count-1).
 +
 +
So far we have the two unrelated tables for our two dimensions: location and time. Now let's add humidity and temperature, which refers to both of the dimensions:
 +
 +
    dimensions:
 +
        ... as above
 +
        time = UNLIMITED ;
 +
    variables:
 +
        ... as above
 +
        float humidity(time, station) ;
 +
            station_code:table_name = "data";
 +
        float temperature(time, station) ;
 +
            station_code:table_name = "data";
 +
 +
This is identical to CF-1.6 encoding. Both variables are 2-dimensional cubes.
 +
 +
The humidity and temperature are now both in the table named ''data'', which has four fields: time, station, humidity and temperature
 +
 +
This is a very compact encoding if every station has data for every time step, but wasteful if data is sparse, since the cubes will then contain mainly null values. That's why CF drafts have had encodings like ragged arrays.
 +
 +
With multiple tables there's an easy way to encode sparse tables. Instead of above, we'll describe the record dimension:
 +
 +
    dimensions:
 +
        ... as above
 +
        data = 800 ; // 800 records
 +
    variables:
 +
        ... as above
 +
        int data_station(data) ;
 +
            data_station:table_name = "data";
 +
            data_station:foreign_key = "station";
 +
        int data_time(data) ;
 +
            data_station:table_name = "data";
 +
            data_station:foreign_key = "time";
 +
        float humidity(data) ;
 +
            humidity:table_name = "data";
 +
        float temperature(data) ;
 +
            temperature:table_name = "data";
 +
 +
The table still has the four columns, but the encoding is not a cube anymore. The time and station dimensions are not netcdf dimensions, but variables like data_station which contains an index to the station table.
 +
 +
To print out a row 5 from data table:
 +
 +
* read the data_station(5). It may contain number 8, which is referring to the station 8 in the station table.
 +
** read station_code(8,all), lat(8) and lon(8)
 +
* read the data_time(5). It may contain number 112, which is referring to the time 112 in the time table.
 +
** read time(112), which contains number 35523, which means 35523 days from 1970-01-01
 +
* read humidity(5) which represents itself.
 +
* read temperature(5)
 +
 +
'''This is a simple and powerful encoding.'''
 +
 +
* '''Any typical schema from SQL-databases can be encoded.'''
 +
* '''If the data is dense, foreign keys can be expressed as dimensions, resulting in vastly smaller files than flat CSV would be.'''
 +
* '''If the data is sparse, foreign keys can be expressed as variables.'''
 +
* '''There can be any combination of the two, enabling the best of both worlds whatever your data will look like.'''

Revision as of 07:29, August 31, 2011

Back to Questions and Comments about CF-1.6 Station Data Convention

Rational Behind this Proposal

The WCS service has potential to grow beyond regular, gridded coverages. Coverage types that need to be described are for example:

  • station point data
  • station trajectory data
  • lidar data
  • moving station data, like aeroplane or ship

A collection of relational tables

Encoding Relational Tables to NetCDF

  • There must be a dimension for the table, to enumerate the records.
  • A column is a variable with that dimension and an attribute table_name
  • A table is made of columns with the same table_name attribute.
  • Primary keys are declared with primary_key = "T" attribute. A table can have multiple keys as a primary key
  • Foreign keys are declared with either using the dimension of another table, or declaring a field to be a reference to another table.

Examples:

Here's how you encode a minimal station table, with all the CF attributes omitted:

   dimensions:
       station = 10 ;  // station table dimension
       code_strlen = 4 ; // artificial dimension for station code, with max four character length.
   variables:
       float lon(station) ; 
           lon:table_name = "station";
       float lat(station) ; 
           lat:table_name = "station";
       char station_code(station, code_strlen) ;
           station_code:table_name = "station";
           station_code:primary_key = "T";

The stations have now index 0-9.

Since we have data with time dimension, let's add the time table, again omitting the CF attributes like units = "days since 1970-01-01 00:00:00"

   dimensions:
       ... as above
       time = UNLIMITED ;
   variables:
       ... as above
       double time(time) ; 
           station_code:table_name = "time";
           station_code:primary_key = "T";

Time can be indexed just as station, 0..(count-1).

So far we have the two unrelated tables for our two dimensions: location and time. Now let's add humidity and temperature, which refers to both of the dimensions:

   dimensions:
       ... as above
       time = UNLIMITED ;
   variables:
       ... as above
       float humidity(time, station) ;
           station_code:table_name = "data";
       float temperature(time, station) ;
           station_code:table_name = "data";

This is identical to CF-1.6 encoding. Both variables are 2-dimensional cubes.

The humidity and temperature are now both in the table named data, which has four fields: time, station, humidity and temperature

This is a very compact encoding if every station has data for every time step, but wasteful if data is sparse, since the cubes will then contain mainly null values. That's why CF drafts have had encodings like ragged arrays.

With multiple tables there's an easy way to encode sparse tables. Instead of above, we'll describe the record dimension:

   dimensions:
       ... as above
       data = 800 ; // 800 records
   variables:
       ... as above
       int data_station(data) ;
           data_station:table_name = "data";
           data_station:foreign_key = "station";
       int data_time(data) ;
           data_station:table_name = "data";
           data_station:foreign_key = "time";
       float humidity(data) ;
           humidity:table_name = "data";
       float temperature(data) ;
           temperature:table_name = "data";

The table still has the four columns, but the encoding is not a cube anymore. The time and station dimensions are not netcdf dimensions, but variables like data_station which contains an index to the station table.

To print out a row 5 from data table:

  • read the data_station(5). It may contain number 8, which is referring to the station 8 in the station table.
    • read station_code(8,all), lat(8) and lon(8)
  • read the data_time(5). It may contain number 112, which is referring to the time 112 in the time table.
    • read time(112), which contains number 35523, which means 35523 days from 1970-01-01
  • read humidity(5) which represents itself.
  • read temperature(5)

This is a simple and powerful encoding.

  • Any typical schema from SQL-databases can be encoded.
  • If the data is dense, foreign keys can be expressed as dimensions, resulting in vastly smaller files than flat CSV would be.
  • If the data is sparse, foreign keys can be expressed as variables.
  • There can be any combination of the two, enabling the best of both worlds whatever your data will look like.