Difference between revisions of "Example SQL for Stations"

From Earth Science Information Partners (ESIP)
Line 45: Line 45:
  
  
The view can be created as following:
+
The SQL view definition:
  
 
     CREATE VIEW pm10_data AS
 
     CREATE VIEW pm10_data AS

Revision as of 11:49, July 11, 2011

Back to WCS Access to netCDF Files

A Real life example how to serve station timeseries point data through WCS data access protocol.

AIRNOW

Daily PM 2.5, PM 10 and Ozone data can be downloaded from date encoded folders. The data and location tables are in text files.

Folder for 2011-07-05: http://www.epa.gov/airnow/2011/20110705/

PM 10 observations: http://www.epa.gov/airnow/2011/20110705/pm10_070511.obs

Active sites: http://www.epa.gov/airnow/2011/20110705/active-sites.dat

View in datafed catalog and datafed borser

The Design of the SQL Database

The most common and flexible method of storing station time series point data is an SQL database. The data collector designs the schema, which captures all the measured data and metadata. This very often requires a database with lots of tables, there is no one-size-fits-all design around because of differing requirements.

Fortunately, relational databases were invented just to solve this problem:

Physical Data Tables and Virtual Data Views

The physical database on the left is the real data storage, which contains multiple tables and typical relationships of a fully normalized schema.

On the right is the virtual WCS data access view, a flat, simple data source to select data from. The WCS does not have to understand the physical structure at all. The real configuration of the WCS system is in the making of these SQL views.


SQL views.png

  • Physical database:
    • The center of this database is the location table.
    • Each location has a unique loc_code as the key column.
    • Each measure parameter has its own data table
      • loc_code for the location of the station
      • datetime for the observation time
      • data columns value and quality flag.
  • Virtual View
    • loc_code, lat and lon is selected from the location table
    • datetime, pm10 and pm10 quality flag is selected from the pm10 data table
    • loc_code is used to join location to the data
    • the result is a flat view with six columns


The SQL view definition:

   CREATE VIEW pm10_data AS
   SELECT location.loc_code, lat, lon, datetime, pm10, pm10_qf
   FROM location 
   INNER JOIN pm10 ON location.loc_code = pm10.loc_code

The WCS point data is published using a fairly popular design philosophy called convention over configuration. Rather than allowing any name, longitude or längengrad, and then mark it to be the longitude column somewhere else, the field name is fixed to lon. On the other hand the data and metadata column names can be anything.

Location View

The Location view needs at least 3 columns: loc_code, lat and lon

The standard field names are loc_code, loc_name (optional), lat, lon, elev (optional). It is important to use these names, since client software, like datafed browser, expects them. Any other fields can be added. The system will just copy the additional data to output.

Location table.png

There are two reasons to use an SQL view also for the location table.

  • The location data may be distributed over several tables. In this case, the view can hide the SQL joins.
  • The location table most likely has different names for the columns. In this case, the view may do nothing else but rename SiteCode to loc_code and latitude to lat.

The location table is common for all the measured data parameters. In many cases, all the locations don't contain all the parameters. For example, the PM 10 parameter in AIRNOW contains data in 202 locations, when there are 2516 locations alltogether. For browsing support, it is important to be able to filter the location table in such a manner, that only the locations with data are displayed.

This is achieved by first calculating statistics for each location and parameter. The view can now be expressed as following

   CREATE VIEW locations_for_pm10 AS
   SELECT location.loc_code, loc_name, lat, lon, area_code
   FROM location
   INNER JOIN loc_stats ON loc_stats.loc_code = location.loc_code
   where loc_stats.param_abbr = 'pm10' and loc_stats.data_count > 0

Notes on Physical Fact - Data Table

There are numerous ways to organize the physical data tables. In the AIRNOW database the parameters are pmfine(PM 2.5), pm10 and super(Ozone)

  • Big Wide table: One big table with columns loc_code datetime pmfine pm10 super
    • Fast to query all the three data fields, no joining.
    • Needs NULLs for missing data
  • Long and skinny table: loc_code datetime param_code param_value
    • The param_code is either pmfine, pm10 or super, telling what data the param value is.
    • No need for NULLs
  • Own table for each parameter. The pm10 table contains fields loc_code datetime pm10
    • fastest to query single parameters
    • No need for NULLs
    • More difficult for system administrator.

There is no universally best solution, pros and conses must be weighted for your case.

The main point is Physical - Logical Separation!

The concept of a view is of ultimate importance in database practice. It allows the system administrator to change the real data tables, without breaking existing software, because the views can be maintained separately. For example, you can change the physical table structure from wide table to long and skinny table and still maintain the pm10 view above.

Sometimes, the views have to do so much joining and filtering, that the performance drop to unacceptable level. In this case, it's advisable to create snapshot tables. A straightforward method is to create a location table with integer based loc_id, and the data tables that use this loc_id as a foreign key, then create an efficient data view for the WCS use.

Configuring the WCS Service and WFS Service for point data

WFS, Web Feature Service for locations

The simplest way to configure location table is just point to the view and it's columns.

   view_alias : location
   columns    : loc_code, loc_name, lat, lon

Many locations don't contain any data of some parameters. Therefore, it's very useful to be able to get all the locations that contain some parameter.

In this case, the location view needs to contain parameter configuration contains:

   view_alias  : location_with_statistics
   columns     : loc_code, loc_name, lat, lon
   view_filter : param_abbr='pm10' and data_count > 0

Here is an example of AIRNOW locations filtered by pm10