Example SQL for Stations

A Real life example how to serve station timeseries point data through WCS data access protocol.

AIRNOW

Daily PM 2.5, PM 10 and Ozone data can be downloaded from date encoded folders. The data and location tables are in text files.

Folder for 2011-07-05: http://www.epa.gov/airnow/2011/20110705/

PM 10 observations: http://www.epa.gov/airnow/2011/20110705/pm10_070511.obs

Active sites: http://www.epa.gov/airnow/2011/20110705/active-sites.dat

View in datafed catalog and datafed borser

The Design of the SQL Database

The most common and flexible method of storing station time series point data is an SQL database. The data collector designs the schema, which captures all the measured data and metadata. This very often requires a database with lots of tables, there is no one-size-fits-all design around because of differing requirements.

Fortunately, relational databases were invented just to solve this problem:

Physical Data Tables and Virtual Data Views

The physical database on the left is the real data storage, which contains multiple tables and typical relationships of a fully normalized schema.

On the right is the virtual WCS data access view, a flat, simple data source to select data from. The WCS does not have to understand the physical structure at all. The real configuration of the WCS system is in the making of these SQL views.

Physical database:
- The center of this database is the location table.
- Each location has a unique loc_code as the key column.
- Each measure parameter has its own data table
  - loc_code for the location of the station
  - datetime for the observation time
  - data columns value and quality flag.

Virtual View
- loc_code, lat and lon is selected from the location table
- datetime, pm10 and pm10 quality flag is selected from the pm10 data table
- loc_code is used to join location to the data
- the result is a flat view with six columns

The SQL view definition:

   CREATE VIEW pm10_data AS
   SELECT location.loc_code, lat, lon, datetime, pm10, pm10_qf
   FROM location 
   INNER JOIN pm10 ON location.loc_code = pm10.loc_code

The WCS point data is published using a fairly popular design philosophy called convention over configuration. Rather than allowing any name, longitude or längengrad, and then mark it to be the longitude column somewhere else, the field name is fixed to lon. On the other hand the data and metadata column names can be anything.

Location View

The Location view needs at least 3 columns: loc_code, lat and lon

The standard field names are loc_code, loc_name (optional), lat, lon, elev (optional). It is important to use these names, since client software, like datafed browser, expects them. Any other fields can be added. The system will just copy the additional data to output.

There are two reasons to use an SQL view also for the location table.

The location data may be distributed over several tables. In this case, the view can hide the SQL joins.
The location table most likely has different names for the columns. In this case, the view may do nothing else but rename SiteCode to loc_code and latitude to lat.

The location table is common for all the measured data parameters. In many cases, all the locations don't contain all the parameters. For example, the PM 10 parameter in AIRNOW contains data in 202 locations, when there are 2516 locations alltogether. For browsing support, it is important to be able to filter the location table in such a manner, that only the locations with data are displayed.

This is achieved by first calculating statistics for each location and parameter. The SQL view definition:

   CREATE VIEW location_with_statistics AS
   SELECT location.loc_code, loc_name, lat, lon, area_code, param_abbr, data_count
   FROM location
   INNER JOIN loc_stats ON loc_stats.loc_code = location.loc_code

This results as a table which lists all the locations, and data counts for all the parameters.

Notes on the Physical Database Schema Possibilities

There are numerous ways to organize the physical data tables, and all of these ways have pros and cons. The AIRNOW database is organized by creating a physical data table for each parameter separately. This arrangement has a some advantages: fastest to query single parameters and no need for NULLs

Other possibilies are:

Big Wide table: One big table with columns loc_code datetime pmfine pm10 super. In this case, querying multiple parameters at the same time requires no joins. Unfortunately, missing data must be expressed with NULLs.

Long and skinny table: loc_code datetime param_code param_value. In this case each row contains two data columns: the parameter code, telling what the measurement actually is, and the parameter value. In this case new parameter codes can be added any time without changing the database schema.

There is no universally best solution, pros and conses must be weighted for your case.

The main point is Physical - Logical Separation! Whatever is your physical schema, the flat views are used for data access. The view system allows you to change the physical schema completely without changing the WCS configuration.

Configuring the WCS Service and WFS Service for point data

In the CF-NetCDF, the CF metadata is enough to do all the configuration. The configuration contains:

view names for location and data parameters
fields for each of the views

With SQL views in place, you still need to tell the WCS server the view names. In the AIRNOW case

   view_alias  : location_with_statistics
   columns     : loc_code, loc_name, lat, lon
   view_filter : param_abbr='pm10' and data_count > 0

Here is an example of AIRNOW locations filtered by pm10.