Difference between revisions of "Example CF-NetCDF for Satellite"

From Earth Science Information Partners (ESIP)
 
(34 intermediate revisions by one other user not shown)
Line 30: Line 30:
 
* Year 2011, growing: ftp://toms.gsfc.nasa.gov/pub/omi/data/aerosol/Y2011
 
* Year 2011, growing: ftp://toms.gsfc.nasa.gov/pub/omi/data/aerosol/Y2011
  
Grid size 360 * 180  
+
Grid size 360 * 180
  
The python module AI_ftp.py contains the templates for ftp data urls, like
+
The longitude size went from 288 to 360
  
    template_path_omi = '/pub/omi/data/aerosol/Y%Y/L3_aersl_omi_%Y%m%d.txt'
+
== Creating an Empty CF-NetCDF File ==
    first_omi_datetime = datetime.datetime(2004, 9, 6)
 
  
The %Y %m %d are python format codes, 4-digit year, 2-digit month, 2-digit day. By formatting the datetime with the template, you can get the data url.
+
Creating NetCDF files programmatically is doable, but harder than it should be. It's much easier to create a high level, text version of the file and use a tool to turn it into a real binary NetCDF file. We use NetCDF Markup Langunage, NCML. The NCML language is similar to the [http://www.unidata.ucar.edu/software/netcdf/docs/netcdf/CDL-Syntax.html CDL] language. Since NCML is XML based, it's more verbose than the domain-specific CDL. But there are few tools, that understand CDL, whereas every programming language has an XML package. Therefore NCML is used.
  
== Creating an Empty CF-NetCDF File ==
+
Contents of the Definition File
  
Creating NetCDF files programmatically is doable, but harder than it should be. It's much easier to create a high level, text version of the file and use a tool to turn it into a real binary NetCDF file.
+
A NetCDF file with CF convention contains following things:
  
We use NetCDF Markup Langunage, NCML, to create the CF-NetCDF file AerosolIndex.nc.
+
* Global Attributes
 +
* Dimensions
 +
* Dimension Variables
 +
* Data Variables
  
The given [http://data1.datafed.net:8080/static/NASA_TOMS/AerosolIndex.ncml AerosolIndex.ncml] describes the three dimensions: lat, lon & time, the dimension variables and data variable AI(time,lat,lon). Once the NCML file is done, creating a netcdf file is one line of code.
+
CF Conventions are not enforced by NCML. A person who designs an empty CF-NetCDF must know the basics of the conventions.
  
The NCML language is similar to the [http://www.unidata.ucar.edu/software/netcdf/docs/netcdf/CDL-Syntax.html CDL] language. Since NCML is XML based, it's more verbose than the domain-specific CDL. But there are very few tools, that understand CDL, whereas every programming language has an XML package. Therefore NCML is used.
+
The real [http://data1.datafed.net:8080/static/NASA_TOMS/AerosolIndex.ncml AerosolIndex.ncml] contains the following declarations.
  
=== NetCDF Markup Language (NCML) ===
+
=== Global Attributes ===
  
[http://www.unidata.ucar.edu/software/netcdf/ncml/ NCML documentation]
+
    Conventions = "CF-1.0"
 +
    title = "NASA TOMS Project"
 +
    comment = "NASA Total Ozone Mapping Spectrometer Project"
 +
    keywords = "TimeRes:Day, DataSet:TOMS_AI_G"
  
The first line is the root element and namespace
+
The file follows CF conventions version 1.0. Attributes '''title''', '''comment''' and '''keywords''' are published by the WCS as coverage title, description and keywords. Other attributes can exists without limits.
  
    <netcdf xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2">
+
=== Dimensions and Dimension Variables ===
  
The '''explicit''' means, that all the metadata is here. Alternatively, '''readMetadata''' would require an existing source netcdf file.
+
The AerosolIndex is a three-dimensional variable. Two of the dimensions are fixed: Latitude and Longitude.
  
        <explicit />
+
The NetCDF dimensions contain only the dimension length. That's why CF convention associates a dimensional variable with each.
  
Global attributes. Notice, that NCML does not require CF conventions. Therefore, you have to declare the convention yourself.
+
For latitude, the dimension and variable name is '''lat''' for both. The data in the variable [-89.5, -88.5 ... 88.5, 89.5] enumerates latitude of each index in the dimension. The variable needs also standard attributes:
  
        <attribute name="title" type="string" value="NASA TOMS Project" />
+
    standard_name = "latitude"
        <attribute name="comment" type="string" value="NASA Total Ozone Mapping Spectrometer Project" />
+
    long_name = "latitude"
        <attribute name="Conventions" type="string" value="CF-1.0" />
+
    units = "degrees_north"
 +
    axis = "Y"
  
Declare dimensions. This is a 3-dimensional grid, with time as the unlimited dimension. Since there are two grid sizes, 288 with 1.25 degree steps and 360 with 1.0 degree steps, we choose 360 steps for the longitude and adjust the Nimbus2 and TOMS data.
+
Similar for longitude dimension.  
  
        <dimension name="time" length="0" isUnlimited="true" />
+
Time dimension is a little different. It is unlimited, meaning that new time slices can be added to the data. The time dimension is initially length 0 and the time variable is an integer, counting days from the first data measurement time:
        <dimension name="lat" length="180" />
 
        <dimension name="lon" length="360" />
 
  
Time dimension. It is advisable to use integers as data type. If you have hourly data, don't use "days since 2000-01-01" and float datatype, since days = hours/24 does not have a nice decimal digit representation. You'll get rounding errors etc.
+
    standard_name = "time"
 +
    long_name = "time"
 +
    units = "days since 1978-11-01"
 +
    axis = "T"
  
        <variable name="time" type="int" shape="time">
+
So for 1978-11-01 the time variable value is "0" and for 1978-11-02 it is "1".
            <attribute name="standard_name" type="string" value="time" />
 
            <attribute name="long_name" type="string" value="time" />
 
            <attribute name="units" type="string" value="days since 1978-11-01" />
 
            <attribute name="axis" type="string" value="T" />
 
        </variable>
 
  
Geographical dimensions. The naming and attributes are important. Dimension values can be written either with start and increment, or enumerated as a list of values. Geographical dimensions should be evenly spaced, WCS expects that, but elevation dimensions or similar dimensions can be enumerated.
+
It is advisable to use integers as data type. If you have hourly data, don't use "days since 2000-01-01" and float datatype, since days = hours/24 does not have a nice decimal digit representation. You'll get rounding errors etc.
  
        <variable name="lat" type="double" shape="lat">
+
=== Data Variables ===
            <attribute name="standard_name" type="string" value="latitude" />
 
            <attribute name="long_name" type="string" value="latitude" />
 
            <attribute name="units" type="string" value="degrees_north" />
 
            <attribute name="axis" type="string" value="Y" />
 
            <values start="-89.5" increment="1" />
 
        </variable>
 
        <variable name="lon" type="double" shape="lon">
 
            <attribute name="standard_name" type="string" value="longitude" />
 
            <attribute name="long_name" type="string" value="longitude" />
 
            <attribute name="units" type="string" value="degrees_east" />
 
            <attribute name="axis" type="string" value="X" />
 
            <values start="-179.5" increment="1" />
 
        </variable>
 
  
The data variable. The _FillValue and missing_value should be the same, and NaN is the recommended missing value. Numeric values like -999 are dangerous, since they may accidentally mess up averages etc.
+
The data variable is type float, dimensions (time, lat, lon). Time must be the first dimension, since it is unlimited. Order (time, lat, lon) is recommended by CF conventions.  
  
        <variable name="AI" type="float" shape="time lat lon">
+
    long_name = "Aerosol Index"
            <attribute name="long_name" type="string" value="Aerosol Index" />
+
    units = "fraction"
            <attribute name="units" type="string" value="fraction" />
+
    _FillValue = NaN
            <attribute name="_FillValue" type="float" value="NaN" />
+
    missing_value = NaN
            <attribute name="missing_value" type="float" value="NaN" />
 
        </variable>
 
  
Closing tag.
+
Missing data is most naturally presented by Not a Number, NaN. Using values like -999 is possible, but carries the danger of causing errors in calculations.
  
    </netcdf>
+
=== Creating NetCDF from NCML ===
  
=== Create Script AI_create.py ===
+
By running the following python program [http://data1.datafed.net:8080/static/NASA_TOMS/AI_create.py AI_create.py] the empty netcdf cube is done. Any other NCML tool could be used.
  
By running the following python program '''create.py''' the empty netcdf cube is done.
+
== Downloading the Data Files ==
  
    from datafed import cf1
+
It's possible to download a data file and directly append it into the netcdf cube, without storing any temporary files. This this approach has the drawback, that if anything goes wrong, you have to download everything again. Maybe you want to do some data processing, and now redoing the whole process is inconvenient due to long download times. With current disk space prices, it's better to first download the files and store them locally. Since these are just text files, it would be possible to just use your browser to download them. While this works great when the dataset is updated max four times a year. If the data is updated daily, it's very nice to have a script you can call at will to download a file.
    cf1.create_ncml22('AerosolIndex.nc', 'AerosolIndex.ncml', '64bitoffset')
 
  
Any other NCML tool can be used. The '64bitoffset' option creates a cube that can grow beyond 2 GB.
+
The module [http://data1.datafed.net:8080/static/NASA_TOMS/AI_ftp.py AI_ftp.py] does just that. It stores the file locally, and retrieves only new files. It can be also used as a library, it allows downloading a file at will.
  
== Downloading the Data Files ==
+
== Compile Data Files into CF-NetCDF ==
 +
 
 +
=== Compile the Text File into an rectangular array ===
 +
 
 +
This is the part that requires most of the programming in the whole system. Since the data is in text files, the script needs to read the text, parse the  numbers and assign into the time slice array. You cannot create one text file reading routine that can read any format. So this code cannot be directly applied to anything but these AerosolIndex files, you have to write your own reader for your data.
 +
 
 +
In many cases, if the data is in daily netcdf files, this parsing can be omitted since reading arrays from netcdf files is trivial.
  
It's possible to download a data file and directly append it into the netcdf cube, without storing any temporary files. This this approach has the drawback, that if anything goes wrong, you have to download everything again. It's not necessarily an error, but maybe you want to do some data processing, and now redoing the whole process is inconvenient. With current disk spaces, it's better to first download the files and store them locally.
+
The [http://data1.datafed.net:8080/static/NASA_TOMS/AI_data.py data parser] can serve as sample code.
  
The module AI_ftp.py does just that. It stores the file locally, and retrieves only new files.
+
=== Append the time slice ===
  
=== Make URL from the Template ===
+
This is again the easy part: since everything is now standardized, the library interface can be easy to use. [http://data1.datafed.net:8080/static/NASA_TOMS/AI_update.py AI_update.py] is short and straightforward.
  
Programmer can now get the url, the following returns '/pub/omi/data/aerosol/Y%Y/L3_aersl_omi_20100324.txt':
+
== Client-side browser view of NASA TOMS WCS ==
  
    AI_ftp.determine_ftp_path(datetime.datetime(2010, 3, 24)
+
Map Query for large area and single time instance: Identifier=AerosolIndex RangeSubset=AI TimeSequence=2011-01-01 BoundingBox=-179.5,-89.5,179.5,89.5] . The actual WCS getCoverage call is:
  
=== Download the Text File ===
+
http://data1.datafed.net:8080/NASA_TOMS?Service=WCS&Version=1.1.2&Request=GetCoverage&Identifier=AerosolIndex&Format=image/netcdf&Store=true&TimeSequence=2011-01-01&RangeSubset=AI&BoundingBox=-179.5,-89.5,179.5,89.5,urn:ogc:def:crs:OGC:2:84
  
Since these are just text files, it would be possible to just use your browser to download them. While this works great when the dataset is updated max four times a year. If the data is updated daily, it's very nice to have a script you can call at will to download a file.
 
  
The module AI_ftp combines these two. If invoked from the command line:
+
Time Series Query for a time range and a single location: Identifier=AerosolIndex RangeSubset=AI TimeSequence=2010-01-01/2011-07-05/P1D . The actual WCS getCoverage call is:  
  
    AI_ftp 1978-11-01 2011-07-07
+
http://data1.datafed.net:8080/NASA_TOMS?Service=WCS&Version=1.1.2&Request=GetCoverage&Identifier=AerosolIndex&Format=image/netcdf&Store=true&TimeSequence=2010-01-01/2011-07-05&RangeSubset=AI&BoundingBox=-10.5,4.5,-10.5,4.5,urn:ogc:def:crs:OGC:2:84
  
if downloads the files in that time range. With one date, it downloads everything up to today.
+
TimeSequence=2005-06-01/2011-09-01/PT1H has ''time_min/time_max/periodicity''. [http://en.wikipedia.org/wiki/ISO_8601#Durations ISO 8601 definition], PT1H is hourly, P1D is daily.
  
Used as a library, it allows downloading a file at will:
+
== Final Product ==
  
    import AI_ftp, datetime
+
[http://webapps.datafed.net/Core.uFIND?dataset=toms_ai_g Core Catalog]
    ftp_conn = AI_ftp.FtpRetriever()
 
    AI_ftp.download_one_datetime(ftp_conn, datetime.datetime(2010, 2, 28)
 
  
== Compile Data Files into CF-NetCDF ==
+
[http://webapps.datafed.net/datafed.aspx?wcs=http://data1.datafed.net:8080/NASA_TOMS&coverage=AerosolIndex Browse Data]
  
=== Compile the Text File into an rectangular array ===
+
[http://data1.datafed.net:8080/NASA_TOMS Service Page]
  
=== Append the time slice ===
+
[[category:Climate Forecast Conventions]]

Latest revision as of 16:42, April 23, 2012

Back to WCS Access to netCDF Files

A Real life example how to download and store satellite data.

Total Ozone Mapping Spectrometer (TOMS) satellite

This data has been collected by three different satellites from 1978 to present. It is downloadable in text files, as a linear rectangular grid.

Nimbus satellite:

Grid size 288 * 180

EPTOMS satellite:

Grid size 288 * 180

As you can see, there is a gap between 1993-05-06 and 1996-07-22.

OMI satellite:

Grid size 360 * 180

The longitude size went from 288 to 360

Creating an Empty CF-NetCDF File

Creating NetCDF files programmatically is doable, but harder than it should be. It's much easier to create a high level, text version of the file and use a tool to turn it into a real binary NetCDF file. We use NetCDF Markup Langunage, NCML. The NCML language is similar to the CDL language. Since NCML is XML based, it's more verbose than the domain-specific CDL. But there are few tools, that understand CDL, whereas every programming language has an XML package. Therefore NCML is used.

Contents of the Definition File

A NetCDF file with CF convention contains following things:

  • Global Attributes
  • Dimensions
  • Dimension Variables
  • Data Variables

CF Conventions are not enforced by NCML. A person who designs an empty CF-NetCDF must know the basics of the conventions.

The real AerosolIndex.ncml contains the following declarations.

Global Attributes

   Conventions = "CF-1.0"
   title = "NASA TOMS Project"
   comment = "NASA Total Ozone Mapping Spectrometer Project"
   keywords = "TimeRes:Day, DataSet:TOMS_AI_G"

The file follows CF conventions version 1.0. Attributes title, comment and keywords are published by the WCS as coverage title, description and keywords. Other attributes can exists without limits.

Dimensions and Dimension Variables

The AerosolIndex is a three-dimensional variable. Two of the dimensions are fixed: Latitude and Longitude.

The NetCDF dimensions contain only the dimension length. That's why CF convention associates a dimensional variable with each.

For latitude, the dimension and variable name is lat for both. The data in the variable [-89.5, -88.5 ... 88.5, 89.5] enumerates latitude of each index in the dimension. The variable needs also standard attributes:

   standard_name = "latitude"
   long_name = "latitude"
   units = "degrees_north"
   axis = "Y"

Similar for longitude dimension.

Time dimension is a little different. It is unlimited, meaning that new time slices can be added to the data. The time dimension is initially length 0 and the time variable is an integer, counting days from the first data measurement time:

   standard_name = "time"
   long_name = "time"
   units = "days since 1978-11-01"
   axis = "T"

So for 1978-11-01 the time variable value is "0" and for 1978-11-02 it is "1".

It is advisable to use integers as data type. If you have hourly data, don't use "days since 2000-01-01" and float datatype, since days = hours/24 does not have a nice decimal digit representation. You'll get rounding errors etc.

Data Variables

The data variable is type float, dimensions (time, lat, lon). Time must be the first dimension, since it is unlimited. Order (time, lat, lon) is recommended by CF conventions.

   long_name = "Aerosol Index"
   units = "fraction"
   _FillValue = NaN
   missing_value = NaN

Missing data is most naturally presented by Not a Number, NaN. Using values like -999 is possible, but carries the danger of causing errors in calculations.

Creating NetCDF from NCML

By running the following python program AI_create.py the empty netcdf cube is done. Any other NCML tool could be used.

Downloading the Data Files

It's possible to download a data file and directly append it into the netcdf cube, without storing any temporary files. This this approach has the drawback, that if anything goes wrong, you have to download everything again. Maybe you want to do some data processing, and now redoing the whole process is inconvenient due to long download times. With current disk space prices, it's better to first download the files and store them locally. Since these are just text files, it would be possible to just use your browser to download them. While this works great when the dataset is updated max four times a year. If the data is updated daily, it's very nice to have a script you can call at will to download a file.

The module AI_ftp.py does just that. It stores the file locally, and retrieves only new files. It can be also used as a library, it allows downloading a file at will.

Compile Data Files into CF-NetCDF

Compile the Text File into an rectangular array

This is the part that requires most of the programming in the whole system. Since the data is in text files, the script needs to read the text, parse the numbers and assign into the time slice array. You cannot create one text file reading routine that can read any format. So this code cannot be directly applied to anything but these AerosolIndex files, you have to write your own reader for your data.

In many cases, if the data is in daily netcdf files, this parsing can be omitted since reading arrays from netcdf files is trivial.

The data parser can serve as sample code.

Append the time slice

This is again the easy part: since everything is now standardized, the library interface can be easy to use. AI_update.py is short and straightforward.

Client-side browser view of NASA TOMS WCS

Map Query for large area and single time instance: Identifier=AerosolIndex RangeSubset=AI TimeSequence=2011-01-01 BoundingBox=-179.5,-89.5,179.5,89.5] . The actual WCS getCoverage call is:

http://data1.datafed.net:8080/NASA_TOMS?Service=WCS&Version=1.1.2&Request=GetCoverage&Identifier=AerosolIndex&Format=image/netcdf&Store=true&TimeSequence=2011-01-01&RangeSubset=AI&BoundingBox=-179.5,-89.5,179.5,89.5,urn:ogc:def:crs:OGC:2:84


Time Series Query for a time range and a single location: Identifier=AerosolIndex RangeSubset=AI TimeSequence=2010-01-01/2011-07-05/P1D . The actual WCS getCoverage call is:

http://data1.datafed.net:8080/NASA_TOMS?Service=WCS&Version=1.1.2&Request=GetCoverage&Identifier=AerosolIndex&Format=image/netcdf&Store=true&TimeSequence=2010-01-01/2011-07-05&RangeSubset=AI&BoundingBox=-10.5,4.5,-10.5,4.5,urn:ogc:def:crs:OGC:2:84

TimeSequence=2005-06-01/2011-09-01/PT1H has time_min/time_max/periodicity. ISO 8601 definition, PT1H is hourly, P1D is daily.

Final Product

Core Catalog

Browse Data

Service Page