Encoding Relational Tables in NetCDF

Back to Questions and Comments about CF-1.6 Station Data Convention

Rational Behind this Proposal

The WCS service has potential to grow beyond regular, gridded coverages. Coverage types that need to be described are for example:

station point data
station trajectory data
lidar data
moving station data, like aeroplane or ship

With a relational system, it is easy to design a schema that captures all the data and the metadata.

The netCDF data format was designed to contain mainly gridded multidimensional data. The format does not have a special support for tables, which sounds like a severe restriction, but it's not.

It's possible to agree about a simple way to encode tables in a netCDF file.
- Tables with multiple fields.
- Proper primary keys for tables.
- Foreign keys for referential integrity.
- Common relationships: many-to-many, 1-to-many, 1-to-1.

It's possible to be efficient, the encoding does not have to be one size fits all.
- Sparse tables can be encoded row-by-row manner.
- Dense tables can be encoded as multidimensional hypercubes.
- Mix and match the two above at will.

It's possible to be elegant.
- No mysterious indirections.
- No complex offset calculations.
- A reader-writer library can take care of following the rules, letting a programmer to operate on higher level.

Encoding Table Columns

Table is a collection of columns. These are indicated with table_name attribute.

Every table needs a primary key. It is indicated with primary_key = "T" attribute.

Example: A station table with three columns: station_code, lat and lon. There are ten rows with indexes 0..9

   dimensions:
       station = 10 ;  // station table dimension
       station_code_length = 4 ; // artificial dimension for station code, with max four character length.
   variables:
       char station_code(station, station_code_length) ;
           station_code:table_name = "station";
           station_code:primary_key = "T";
       float lon(station) ; 
           lon:table_name = "station";
       float lat(station) ; 
           lat:table_name = "station";

The dimension name station can be anything. It does not have to equal the table name station, but should be for readability. Same for station_code_length, it can be called anything.

A table can have multiple fields as a primary key.

Encoding Multiple Tables

Adding the second table is similar. Since new data is usually added by time, it can be made the unlimited dimension.

   dimensions:
       ... as above
       time = UNLIMITED ;
   variables:
       ... as above
       int time(time) ; 
           station_code:table_name = "time";
           station_code:primary_key = "T";

Encoding Relations

A typical station data instance is a relation (station_code, time, temperature, humidity). The pair (station_code, time) is the primary key; data is recorded at a certain station, at a certain time. Temperature and humidity are the recorded parameters.

Encoding this into netCDF can be done in two ways, using either dimension or a variable.

Encoding Relation as a Multidimensional Array

   variables:
       ... as above
       int temperature(time,station) ; 
           temperature:table_name = "data";
       int humidity(time,station) ; 
           humidity:table_name = "data";

This declares two arrays, one for temperature and one for humidity. Both are two-dimensional. The dimensions time and station refer to the time and station variables. They implicitly became part of the primary key.

   To find temperature at "ACAD" in "2004-17-22":
       Find index for "ACAD" from the "station_code" array. Let's say it's 2.
       Find index for "2004-17-22" from the time array. Lets say it's 317.
       Read temperature(317,2)

This is a very space-efficient encoding, if most of the stations have data most of the time.

Encoding Relation as a Record Array

So far we have the two unrelated tables for our two dimensions: location and time. Now let's add humidity and temperature, which refers to both of the dimensions:

   dimensions:
       ... as above
       time = UNLIMITED ;
   variables:
       ... as above
       float humidity(time, station) ;
           station_code:table_name = "data";
       float temperature(time, station) ;
           station_code:table_name = "data";

This is identical to CF-1.6 encoding. Both variables are 2-dimensional cubes.

The humidity and temperature are now both in the table named data, which has four fields: time, station, humidity and temperature

This is a very compact encoding if every station has data for every time step, but wasteful if data is sparse, since the cubes will then contain mainly null values. That's why CF drafts have had encodings like ragged arrays.

With multiple tables there's an easy way to encode sparse tables. Instead of above, we'll describe the record dimension:

   dimensions:
       ... as above
       data = 800 ; // 800 records
   variables:
       ... as above
       int data_station(data) ;
           data_station:table_name = "data";
           data_station:foreign_key = "station";
       int data_time(data) ;
           data_station:table_name = "data";
           data_station:foreign_key = "time";
       float humidity(data) ;
           humidity:table_name = "data";
       float temperature(data) ;
           temperature:table_name = "data";

The table still has the four columns, but the encoding is not a cube anymore. The time and station dimensions are not netcdf dimensions, but variables like data_station which contains an index to the station table.

To print out a row 5 from data table:

read the data_station(5). It may contain number 8, which is referring to the station 8 in the station table.
- read station_code(8,all), lat(8) and lon(8)
read the data_time(5). It may contain number 112, which is referring to the time 112 in the time table.
- read time(112), which contains number 35523, which means 35523 days from 1970-01-01
read humidity(5) which represents itself.
read temperature(5)

This is a simple and powerful encoding.

Any typical schema from SQL-databases can be encoded.
If the data is dense, foreign keys can be expressed as dimensions, resulting in vastly smaller files than flat CSV would be.
If the data is sparse, foreign keys can be expressed as variables.
There can be any combination of the two, enabling the best of both worlds whatever your data will look like.