Difference between revisions of "Encoding Relational Tables in NetCDF"
Line 135: | Line 135: | ||
'''int data_station_code(data) ;''' | '''int data_station_code(data) ;''' | ||
'''data_station_code:table_name = "data";''' | '''data_station_code:table_name = "data";''' | ||
+ | '''data_station_code:foreign_key = "station_code";''' | ||
'''data_station_code:primary_key = "T";''' | '''data_station_code:primary_key = "T";''' | ||
− | |||
'''int data_time(data) ;''' | '''int data_time(data) ;''' | ||
'''data_time:table_name = "data";''' | '''data_time:table_name = "data";''' | ||
+ | '''data_time:foreign_key = "time";''' | ||
'''data_time:primary_key = "T";''' | '''data_time:primary_key = "T";''' | ||
− | |||
'''float temperature(data) ; ''' | '''float temperature(data) ; ''' | ||
'''temperature:table_name = "data";''' | '''temperature:table_name = "data";''' | ||
Line 146: | Line 146: | ||
'''humidity:table_name = "data";''' | '''humidity:table_name = "data";''' | ||
− | The | + | The variable data_station_code is marked ''foreign_key = "station_code"'', which means, that the variable is an index to the station_code variable. Similar for time. Both are marked ''primary_key = "T"'' which means, that the combination of (data_station_code, data_time) is unique. |
− | |||
− | |||
− | |||
− | |||
+ | To read the full table (station_code, lat, lon, time, temperature, humidity), row by row: | ||
+ | for idx in range of data | ||
+ | read station: | ||
+ | read data_station_code(idx) | ||
+ | read variables station_code, lat and lon at that index | ||
+ | read time: | ||
+ | read data_time(idx) | ||
+ | read variable time at that index | ||
+ | read temperature(idx) | ||
+ | read humidity(idx) | ||
− | + | This is an space-efficient encoding for sparse tables. | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | + | == Summa Summarum == | |
* '''Any typical schema from SQL-databases can be encoded.''' | * '''Any typical schema from SQL-databases can be encoded.''' |
Revision as of 10:11, August 31, 2011
Back to Questions and Comments about CF-1.6 Station Data Convention
Rational Behind this Proposal
The WCS service has potential to grow beyond regular, gridded coverages. Coverage types that need to be described are for example:
- station point data
- station trajectory data
- lidar data
- moving station data, like aeroplane or ship
With a relational system, it is easy to design a schema that captures all the data and the metadata.
The netCDF data format was designed to contain mainly gridded multidimensional data. The format does not have a special support for tables, which sounds like a severe restriction, but it's not.
- It's possible to agree about a simple way to encode tables in a netCDF file.
- Tables with multiple fields.
- Proper primary keys for tables.
- Foreign keys for referential integrity.
- Common relationships: many-to-many, 1-to-many, 1-to-1.
- It's possible to be efficient, the encoding does not have to be one size fits all.
- Sparse tables can be encoded row-by-row manner.
- Dense tables can be encoded as multidimensional hypercubes.
- Mix and match the two above at will.
- It's possible to be elegant.
- No mysterious indirections.
- No complex offset calculations.
- A reader-writer library can take care of following the rules, letting a programmer to operate on higher level.
Encoding Table Columns
Table is a collection of columns. These are indicated with table_name attribute.
Every table needs a primary key. It is indicated with primary_key = "T" attribute.
Example: A station table with three columns: station_code, lat and lon. There are ten rows with indexes 0..9
dimensions: station = 10 ; // station table dimension station_code_length = 4 ; // artificial dimension for station code, with max four character length. variables: char station_code(station, station_code_length) ; station_code:table_name = "station"; station_code:primary_key = "T"; float lon(station) ; lon:table_name = "station"; float lat(station) ; lat:table_name = "station";
The dimension name station can be anything. It does not have to equal the table name station, but should be for readability. Same for station_code_length, it can be called anything.
A table can have multiple fields as a primary key.
Encoding Multiple Tables
Adding the second table is similar. Since new data is usually added by time, it can be made the unlimited dimension.
dimensions: station = 10 ; // station table dimension station_code_length = 4 ; // artificial dimension for station code, with max four character length. time = UNLIMITED ; variables: char station_code(station, station_code_length) ; station_code:table_name = "station"; station_code:primary_key = "T"; float lon(station) ; lon:table_name = "station"; float lat(station) ; lat:table_name = "station"; int time(time) ; time:table_name = "time"; time:primary_key = "T";
Encoding Relations
A typical station data instance is a relation (station_code, time, temperature, humidity). The pair (station_code, time) is the primary key; data is recorded at a certain station, at a certain time. Temperature and humidity are the recorded parameters.
Encoding this into netCDF can be done in two ways, using either dimension or a variable.
Encoding Relation as a Multidimensional Array
This is identical to CF-1.6 encoding. Both variables are 2-dimensional cubes.
dimensions: station = 10 ; // station table dimension station_code_length = 4 ; // artificial dimension for station code, with max four character length. time = UNLIMITED ; variables: char station_code(station, station_code_length) ; station_code:table_name = "station"; station_code:primary_key = "T"; float lon(station) ; lon:table_name = "station"; float lat(station) ; lat:table_name = "station"; int time(time) ; time:table_name = "time"; time:primary_key = "T"; float temperature(time,station) ; temperature:table_name = "data"; float humidity(time,station) ; humidity:table_name = "data";
This declares two arrays, one for temperature and one for humidity. Both are two-dimensional. The dimensions time and station refer to the time and station variables. They implicitly became part of the primary key.
To find temperature at "ACAD" in "2004-17-22": Find index for "ACAD" from the "station_code" array. Let's say it's 2. Find index for "2004-17-22" from the time array. Lets say it's 317. Read temperature(317,2)
This is a very space-efficient encoding, if most of the stations have data most of the time.
Encoding Relation as a Record Array
So far we have the two unrelated tables for our two dimensions: location and time. Now let's add humidity and temperature, which refers to both of the dimensions:
dimensions: station = 10 ; // station table dimension station_code_length = 4 ; // artificial dimension for station code, with max four character length. data = UNLIMITED // each row in the data table. time = 2000 ; we can't have two unlimited dimensions. variables: char station_code(station, station_code_length) ; station_code:table_name = "station"; station_code:primary_key = "T"; float lon(station) ; lon:table_name = "station"; float lat(station) ; lat:table_name = "station"; int time(time) ; time:table_name = "time"; time:primary_key = "T"; int data_station_code(data) ; data_station_code:table_name = "data"; data_station_code:foreign_key = "station_code"; data_station_code:primary_key = "T"; int data_time(data) ; data_time:table_name = "data"; data_time:foreign_key = "time"; data_time:primary_key = "T"; float temperature(data) ; temperature:table_name = "data"; float humidity(data) ; humidity:table_name = "data";
The variable data_station_code is marked foreign_key = "station_code", which means, that the variable is an index to the station_code variable. Similar for time. Both are marked primary_key = "T" which means, that the combination of (data_station_code, data_time) is unique.
To read the full table (station_code, lat, lon, time, temperature, humidity), row by row: for idx in range of data read station: read data_station_code(idx) read variables station_code, lat and lon at that index read time: read data_time(idx) read variable time at that index read temperature(idx) read humidity(idx)
This is an space-efficient encoding for sparse tables.
Summa Summarum
- Any typical schema from SQL-databases can be encoded.
- If the data is dense, foreign keys can be expressed as dimensions, resulting in vastly smaller files than flat CSV would be.
- If the data is sparse, foreign keys can be expressed as variables.
- There can be any combination of the two, enabling the best of both worlds whatever your data will look like.