Example CF-NetCDF for Satellite

A Real life example how to download and store satellite data.

Total Ozone Mapping Spectrometer (TOMS) satellite

This data has been collected by three different satellites from 1978 to present. It is downloadable in text files, as a linear rectangular grid.

Nimbus satellite:

Data download site. ftp://toms.gsfc.nasa.gov/pub/omi/data/aerosol/
First 1978-11-01: ftp://toms.gsfc.nasa.gov/pub/nimbus7/data/aerosol/Y1978/L3_aersl_n7t_19781101.txt
Last 1993-05-06: ftp://toms.gsfc.nasa.gov/pub/nimbus7/data/aerosol/Y1993/L3_aersl_n7t_19930506.txt

Grid size 288 * 180

EPTOMS satellite:

Data download site ftp://toms.gsfc.nasa.gov/pub/eptoms/data/aerosol/
First 1996-07-22: ftp://toms.gsfc.nasa.gov/pub/eptoms/data/aerosol/Y1996/L3_aersl_ept_19960722.txt
Last 2007-12-31: ftp://toms.gsfc.nasa.gov/pub/eptoms/data/aerosol/Y2005/L3_aersl_ept_20051231.txt

Grid size 288 * 180

As you can see, there is a gap between 1993-05-06 and 1996-07-22.

OMI satellite:

data download site: ftp://toms.gsfc.nasa.gov/pub/omi/data/aerosol/
First 2004-10-01: ftp://toms.gsfc.nasa.gov/pub/omi/data/aerosol/Y2004/L3_aersl_omi_20041001.txt
Year 2011, growing: ftp://toms.gsfc.nasa.gov/pub/omi/data/aerosol/Y2011

Grid size 360 * 180

The python module AI_ftp.py contains the templates for ftp data urls, like

   template_path_omi = '/pub/omi/data/aerosol/Y%Y/L3_aersl_omi_%Y%m%d.txt'
   first_omi_datetime = datetime.datetime(2004, 9, 6)

The %Y %m %d are python format codes, 4-digit year, 2-digit month, 2-digit day. By formatting the datetime with the template, you can get the data url.

Creating an Empty CF-NetCDF File

Creating NetCDF files programmatically is doable, but harder than it should be. It's much easier to create a high level, text version of the file and use a tool to turn it into a real binary NetCDF file.

We use NetCDF Markup Langunage, NCML, to create the CF-NetCDF file AerosolIndex.nc.

The given AerosolIndex.ncml describes the three dimensions: lat, lon & time, the dimension variables and data variable AI(time,lat,lon). Once the NCML file is done, creating a netcdf file is one line of code.

The NCML language is similar to the CDL language. Since NCML is XML based, it's more verbose than the domain-specific CDL. But there are very few tools, that understand CDL, whereas every programming language has an XML package. Therefore NCML is used.

NetCDF Markup Language (NCML)

NCML documentation

The first line is the root element and namespace

   <netcdf xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2">

The explicit means, that all the metadata is here. Alternatively, readMetadata would require an existing source netcdf file.

       <explicit />

Global attributes. Notice, that NCML does not require CF conventions. Therefore, you have to declare the convention yourself.

       <attribute name="title" type="string" value="NASA TOMS Project" />
       <attribute name="comment" type="string" value="NASA Total Ozone Mapping Spectrometer Project" />
       <attribute name="Conventions" type="string" value="CF-1.0" />

Declare dimensions. This is a 3-dimensional grid, with time as the unlimited dimension. Since there are two grid sizes, 288 with 1.25 degree steps and 360 with 1.0 degree steps, we choose 360 steps for the longitude and adjust the Nimbus2 and TOMS data.

       <dimension name="time" length="0" isUnlimited="true" />
       <dimension name="lat" length="180" />
       <dimension name="lon" length="360" />

Time dimension. It is advisable to use integers as data type. If you have hourly data, don't use "days since 2000-01-01" and float datatype, since days = hours/24 does not have a nice decimal digit representation. You'll get rounding errors etc.

       <variable name="time" type="int" shape="time">
           <attribute name="standard_name" type="string" value="time" />
           <attribute name="long_name" type="string" value="time" />
           <attribute name="units" type="string" value="days since 1978-11-01" />
           <attribute name="axis" type="string" value="T" />
       </variable>

Geographical dimensions. The naming and attributes are important. Dimension values can be written either with start and increment, or enumerated as a list of values. Geographical dimensions should be evenly spaced, WCS expects that, but elevation dimensions or similar dimensions can be enumerated.

       <variable name="lat" type="double" shape="lat">
           <attribute name="standard_name" type="string" value="latitude" />
           <attribute name="long_name" type="string" value="latitude" />
           <attribute name="units" type="string" value="degrees_north" />
           <attribute name="axis" type="string" value="Y" />
           <values start="-89.5" increment="1" />
       </variable>
       <variable name="lon" type="double" shape="lon">
           <attribute name="standard_name" type="string" value="longitude" />
           <attribute name="long_name" type="string" value="longitude" />
           <attribute name="units" type="string" value="degrees_east" />
           <attribute name="axis" type="string" value="X" />
           <values start="-179.5" increment="1" />
       </variable>

The data variable. The _FillValue and missing_value should be the same, and NaN is the recommended missing value. Numeric values like -999 are dangerous, since they may accidentally mess up averages etc.

       <variable name="AI" type="float" shape="time lat lon">
           <attribute name="long_name" type="string" value="Aerosol Index" />
           <attribute name="units" type="string" value="fraction" />
           <attribute name="_FillValue" type="float" value="NaN" />
           <attribute name="missing_value" type="float" value="NaN" />
       </variable>

Closing tag.

   </netcdf>

Create Script AI_create.py

By running the following python program create.py the empty netcdf cube is done.

   from datafed import cf1
   cf1.create_ncml22('AerosolIndex.nc', 'AerosolIndex.ncml', '64bitoffset')

Any other NCML tool can be used. The '64bitoffset' option creates a cube that can grow beyond 2 GB.

Downloading the Data Files

It's possible to download a data file and directly append it into the netcdf cube, without storing any temporary files. This this approach has the drawback, that if anything goes wrong, you have to download everything again. It's not necessarily an error, but maybe you want to do some data processing, and now redoing the whole process is inconvenient. With current disk spaces, it's better to first download the files and store them locally.

The module AI_ftp.py does just that. It stores the file locally, and retrieves only new files.

Make URL from the Template

Programmer can now get the url, the following returns '/pub/omi/data/aerosol/Y%Y/L3_aersl_omi_20100324.txt':

   AI_ftp.determine_ftp_path(datetime.datetime(2010, 3, 24)

Download the Text File

Since these are just text files, it would be possible to just use your browser to download them. While this works great when the dataset is updated max four times a year. If the data is updated daily, it's very nice to have a script you can call at will to download a file.

The module AI_ftp combines these two. If invoked from the command line:

  C:\OWS\web\static\NASA_TOMS>AI_ftp 1978-11-01 2011-07-07

if downloads the files in that time range. With one date, it downloads everything up to today.

Used as a library, it allows downloading a file at will:

   import AI_ftp, datetime
   ftp_conn = AI_ftp.FtpRetriever()
   AI_ftp.download_one_datetime(ftp_conn, datetime.datetime(2010, 2, 28)

Compile Data Files into CF-NetCDF

Compile the Text File into an rectangular array

This is the part that requires most of the programming in the whole system. Since the data is in text files, the script needs to read the text, parse the numbers and assign into the time slice array.

This is also very format specific. You cannot create one text file reading routine that can read any format. So this code cannot be directly applied to anything but these AerosolIndex files.

The daily slice compilation is done in three steps:

   def parse_grid(filename):
       lines = open(filename).readlines()
       lon_count, lat_count = _get_grid_dimensions(lines)
       lat_lines = _group_by_latitude(lines[3:])
       return _parse_body(lat_lines, lat_count, lon_count)

After reading all the lines from the text file, the size of the grid is read. The Nimbus3 and EPTOMS satellite data is exported with 288 longitude lenght, 1.25 degree step, OMI data is 360 steps, 1 degree step. So to read the file, this information cannot be hardcoded.

The lines in the file are

Numeric Python, NumPy, is a python library for multidimensional arrays. Read the Tutorial for more information.

The daily slice itself has the three dimensions: 1 for time, 180 for latitude and 360 for longitude.

   import numpy

   nc_lat_count = 180
   nc_lon_count = 360

   # create the empty array
   grid = numpy.empty( (1, nc_lat_count, nc_lon_count), dtype='float32')

   # fill it with NaNs
   grid[0, slice(nc_lat_count), slice(nc_lon_count)] = float('NaN')

The download procedure