Example CF-NetCDF for Satellite

From Earth Science Information Partners (ESIP)

Back to WCS Access to netCDF Files

A Real life example how to download and store satellite data.

Total Ozone Mapping Spectrometer (TOMS) satellite

This data has been collected by three different satellites from 1978 to present. It is downloadable in text files, as a linear rectangular grid.

Nimbus satellite:

Grid size 288 * 180

EPTOMS satellite:

Grid size 288 * 180

As you can see, there is a gap between 1993-05-06 and 1996-07-22.

OMI satellite:

Grid size 360 * 180

Creating an Empty CF-NetCDF File

Creating NetCDF files programmatically is doable, but harder than it should be. It's much easier to create a high level, text version of the file and use a tool to turn it into a real binary NetCDF file. We use NetCDF Markup Langunage, NCML.

The NCML language is similar to the CDL language. Since NCML is XML based, it's more verbose than the domain-specific CDL. But there are few tools, that understand CDL, whereas every programming language has an XML package. Therefore NCML is used.

NetCDF Markup Language (NCML)

NCML documentation

A NetCDF file with CF convention contains following things:

  • Global Attributes
  • Dimensions
  • Dimension Variables
  • Data Variables

CF Conventions are not enforced by NCML. A person who designs an empty CF-NetCDF must know the basics of the conventions.

The real AerosolIndex.ncml contains the following declarations.

Global Attributes

   Conventions = "CF-1.0"
   title = "NASA TOMS Project"
   comment = "NASA Total Ozone Mapping Spectrometer Project"
   keywords = "TimeRes:Day, DataSet:TOMS_AI_G"

This declares, that the file follows CF conventions. Attributes title, comment and keywords are published by the WCS as coverage title, description and keywords. Other attributes can exists without limits.

Dimensions and Dimension Variables

The AerosolIndex is a three-dimensional variable. Two of the dimensions are fixed: Latitude and Longitude.

The NetCDF dimensions contain only the dimension length. That's why CF convention associates a dimensional variable with each.

For latitude, the dimension and variable name is lat for both. The data in the variable [-89.5, -88.5 ... 88.5, 89.5] enumerates latitude of each index in the dimension. The variable needs also standard attributes:

  • standard_name = "latitude"
  • long_name = "latitude"
  • units = "degrees_north"
  • axis = "Y"

Similar for longitude dimension.

Time dimension is a little different. It is unlimited, meaning that new time slices can be added to the data. The time dimension is initially length 0 and the time variable is an integer, counting days from the first data measurement time:

  • standard_name = "time" ;
  • long_name = "time" ;
  • units = "days since 1978-11-01" ;
  • time:axis = "T" ;

So for 1978-11-01 the time variable value is "0" and for 1978-11-02 it is "1".

It is advisable to use integers as data type. If you have hourly data, don't use "days since 2000-01-01" and float datatype, since days = hours/24 does not have a nice decimal digit representation. You'll get rounding errors etc.

Data Variables

The data variable is type float, dimensions (time, lat, lon). Time must be the first dimension, since it is unlimited. Order lat, lon is recommended by CF conventions. The attributes

  • long_name = "Aerosol Index" ;
  • units = "fraction" ;
  • _FillValue = NaNf ;
  • missing_value = NaNf ;

Creating NetCDF from NCML

By running the following python program AI_create.py the empty netcdf cube is done.

   from datafed import cf1
   cf1.create_ncml22('AerosolIndex.nc', 'AerosolIndex.ncml', '64bitoffset')

Any other NCML tool can be used. The 64bitoffset option creates a cube that can grow beyond 2 GB.

Downloading the Data Files

It's possible to download a data file and directly append it into the netcdf cube, without storing any temporary files. This this approach has the drawback, that if anything goes wrong, you have to download everything again. Maybe you want to do some data processing, and now redoing the whole process is inconvenient due to long download times. With current disk space prices, it's better to first download the files and store them locally.

The module AI_ftp.py does just that. It stores the file locally, and retrieves only new files.

Make URL from the Template

Programmer can now get the ftp path, the following returns '/pub/omi/data/aerosol/Y2010/L3_aersl_omi_20100324.txt':

   AI_ftp.determine_ftp_path(datetime.datetime(2010, 3, 24)

Download the Text File

Since these are just text files, it would be possible to just use your browser to download them. While this works great when the dataset is updated max four times a year. If the data is updated daily, it's very nice to have a script you can call at will to download a file.

The module AI_ftp.py combines these two. If invoked from the command line:

  C:\OWS\web\static\NASA_TOMS>AI_ftp 1978-11-01 2011-07-07

if downloads the files in that time range. With one date, it downloads everything up to today.

Used as a library, it allows downloading a file at will:

   import AI_ftp, datetime
   ftp_conn = AI_ftp.FtpRetriever()
   AI_ftp.download_one_datetime(ftp_conn, datetime.datetime(2010, 2, 28)

Compile Data Files into CF-NetCDF

Compile the Text File into an rectangular array

This is the part that requires most of the programming in the whole system. Since the data is in text files, the script needs to read the text, parse the numbers and assign into the time slice array.

This is also very file format specific. You cannot create one text file reading routine that can read any format. So this code cannot be directly applied to anything but these AerosolIndex files, you have to write your own reader for your data.

In many cases, if the data is in daily netcdf files, this parsing can be omitted since reading arrays from netcdf files is trivial.

The daily slice compilation is done in three steps:

   def parse_grid(filename):
       lines = open(filename).readlines()
       lon_count, lat_count = _get_grid_dimensions(lines)
       lat_lines = _group_by_latitude(lines[3:])
       return _parse_body(lat_lines, lat_count, lon_count)

After reading all the lines from the text file, the size of the grid is read. The Nimbus3 and EPTOMS satellite data is exported with 288 longitude lenght, 1.25 degree step, OMI data is 360 steps, 1 degree step. So to read the file, this information cannot be hardcoded.

The data in the file is in rows along latitudes. But since the file lines are limited to 80 characters, one latitude is spread over several lines. The _group_by_latitude code combines these lines. lat_lines contains 180 lines, having data from -180 to 180 degrees east.

The _parse_body(lat_lines, lat_count, lon_count) returns a numpy array. Numeric Python, NumPy, is a python library for multidimensional arrays. Read the Tutorial for more information. Here's the function in pseudocode:

   def _parse_body(lat_lines, lat_count, lon_count):
       grid = create NaN filled array size (1, lat_count, lon_count)
       for ilat in xrange(lat_count):
           calcucate latitude
           for ilon in xrange(lon_count):
               calcucate longitude
               read 3-character value
               if value != "999":
                   calculate indexes in the netcdf file from latitude and longitude
                   assign value
       return grid

Append the time slice

This is again the easy part: since everything is now standardized, the library interface can be easy to use.

   def update_range(start_time, end_time):
       nc_cube = cf1.open('AerosolIndex.nc', 'w')
       dt = start_time
       while dt <= end_time:
           update_one_datetime(dt, nc_cube.variables['AI'])
           dt += datetime.timedelta(days=1)
   
   def update_one_datetime(dt, nc_var):
       filename = AI_ftp.get_local_filename(dt)
       if os.path.exists(filename):
           grid = AI_data.parse_grid(filename)
           nc_var.put_time_slice(grid, dt)

The put_time_slice method first tries to find if the data already exists in the cube. If it does, the data is overwritten. Otherwise a new time index is created and the slice is appended to the data variable.

Final Product

links here