Creating NetCDF CF Files

From Earth Science Information Partners (ESIP)

Back to WCS Access to netCDF Files

NetCDF-CF Convention

The reason to use CF convention: Enable plug and play connectivity.

This does not come by simply making good human readable NetCDF files. After all: what could dimension longitude mean besides longitude. If you get data in NetCDF format, it's usually fairly easy to see what really is there. It's also easy to write a generic browser, that can display every variable for you.

But since a lot of data in NetCDF files have geographical meaning, a graphical viewer should be able to draw the data on the map, on it's own. This involves, at minimum:

  • finding the three geographical dimensions
  • Finding time dimension, if any
  • Understanding the geographical projection

From any generic NetCDF, this requires human intelligence. After all, the n-dimensional data variables, dimensional variables and other metadata variables look precisely the same for the program code. There are legion of ways to code projection information, and decoding it reliably is very difficult.

Conventions come to rescue. For example, in CF convention X-axis dimension variable must have an attribute axis='X', and the values are points on X-axis. No guesswork.

The best documentation is at CF Metadata page.

Example showing a NetCDF file as a CDL text

CDL = network Common data form Description Language

Notice, that CDL is convention agnostic, it does not know about CF conventions.

Using ncdump to generate CDL output for a NetCDF-CF file may look like this:

netcdf TOMS_AI_58 {
dimensions:
	time = 1 ;
	lat = 3 ;
	lon = 4 ;
variables:
	double time(time) ;
		time:standard_name = "time" ;
		time:long_name = "time" ;
		time:units = "days since 1979-01-01" ;
		time:axis = "T" ;
	double lat(lat) ;
		lat:standard_name = "latitude" ;
		lat:long_name = "latitude" ;
		lat:units = "degrees_north" ;
		lat:axis = "Y" ;
	double lon(lon) ;
		lon:standard_name = "longitude" ;
		lon:long_name = "longitude" ;
		lon:units = "degrees_east" ;
		lon:axis = "X" ;
	byte AI(time, lat, lon) ;
		AI:long_name = "Aerosol Index" ;
		AI:units = "fraction" ;
		AI:_FillValue = -1b ;
		AI:missing_value = -1b ;

// global attributes:
		:title = "NASA TOMS Project" ;
		:comment = "NASA Total Ozone Mapping Spectrometer Project" ;
		:Conventions = "CF-1.0" ;
data:

 time = 9952 ;

 lat = 32.5, 33.5, 34.5 ;

 lon = -89.375, -88.125, -86.875, -85.625 ;

 AI =
  0, 2, 1, 2,
  _, 2, 3, 2,
  1, 4, 4, 2 ;
}

Let's go over section by section:

Dimensions

dimensions:
	time = 1 ;
	lat = 3 ;
	lon = 4 ;
}

These names can be anything. The reason is, that sometimes you may want to store two grids, that have different dimensions, into the same file. In that case you could name dimensions lat1, lat2 etc...

Time Dimension Variable

	double time(time) ;
		time:standard_name = "time" ;
		time:long_name = "time" ;
		time:units = "days since 1979-01-01" ;
		time:axis = "T" ;

It's the attribute axis = "T" marks this variable as time dimension.

Latitude and Longitude Dimension Variables

	double lat(lat) ;
		lat:standard_name = "latitude" ;
		lat:long_name = "latitude" ;
		lat:units = "degrees_north" ;
		lat:axis = "Y" ;
	double lon(lon) ;
		lon:standard_name = "longitude" ;
		lon:long_name = "longitude" ;
		lon:units = "degrees_east" ;
		lon:axis = "X" ;

Again, it's the attributes axis = "Y" and axis = "X" that mark geographical dimension variables. The linear projection is acknowledged with standard names.

Data Variable

	byte AI(time, lat, lon) ;
		AI:long_name = "Aerosol Index" ;
		AI:units = "fraction" ;
		AI:_FillValue = -1b ;
		AI:missing_value = -1b ;

The dimensions are marked with axis attributes, so this is a data variable. The units = "fraction" is not standard, and therefore the compliance checker reports it as an error.

Global Attributes

		:title = "NASA TOMS Project" ;
		:comment = "NASA Total Ozone Mapping Spectrometer Project" ;
		:Conventions = "CF-1.0" ;

Only convention and version number Conventions = CF-1.0 is required.

Verifying NetCDF-CF Files

Since CF-1.0 conventions contain a lot of definitions, verifying them by machine is necessary.

Online Compliance Checker

There is a fairly complete compliance checker online. They have some NetCDF documentation and CF convention documentation online too. The latest compliance checker is here, it lets you upload a NetCDF file and does a wide range of checks.

Using Compliance Checker from python as a service

TODO: this is just a standard form with http-post, and therefore it should be easy to use it as a service from python. The owsadmin tool should clone a subset of a netcdf cube and submit it.

Using Offline Compliance Checker

TODO Michael Decker has developed a model_checker, that can currently to the following:

find netCDF files that

  • do not have axis=X|Y|Z|T attributes
  • have unknown axis attributes
  • have non-ASCII characters in string attributes
  • define the same axis name multiple times
  • have no units in T-axis
  • have incompatible time format in T-axis

try to repair all of the above problems except for problems concerning the T-axis which can't be fixed automatically

Making this module to work with nc3 just requires some methods into the cf1 module.

Creating NetCDF-CF files

There are a few ways to create a NetCDF file. In general, it's easiest to create the empty file with a descriptive method, and use a programming language to fill in the data.

Use CDL and ncgen

The CDL text you see above is fairly readable, and ncgen can turn it into a NetCDF file.

Use NCML

NCML, NetCDF Markup Language is an XML language for manipulating NetCDF. The designers provide Java implementation, and Datafed supports creating and verifying files.

Creating and Filling NetCDF Files using NCML and python

After installing Windows or Linux you have the datafed.nc3 and datafed.cf1 modules.

Here's a test ncml: CMAQ_Baron_20.ncml

Lets look at it one section at a time, You'll see it's just as simple as CDL

Document Root


<netcdf xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2">

Explicit Declaration

This means, that only metadata in the ncml file is used. The python implementation only supports explicit, so it's mandatory.


	<explicit />

Global Attributes

For the WCS, only Conventions is required. You can put any attribute into the NetCDF. Valid attribute types are: byte, short, int, long, float, double, String, string and Structure. The two capitalization cases of string is probably a historical accident and datafed does not support Structure.


	<attribute name="title" type="string" value="CMAQ_Baron" />
	<attribute name="comment" type="string" value="Concentration file output From CMAQ model dyn alloc version CTM" />
	<attribute name="Conventions" type="string" value="CF-1.0" />

Grid Dimensions

The spatial grid has size 220 * 288.

The grids are to be appended one time unit at a time, so the time dimension is defined unlimited, being initially 0.


	<dimension name="time" length="0" isUnlimited="true" />
	<dimension name="lat" length="220" />
	<dimension name="lon" length="288" />


Time Dimension Variable, integer meaning hours from start

In a 3D (time, latitude, longitude), every dimensions needs a description variable. The NetCDF variable is read by indexes, and dimension variables give the physical coordinate at that index. The unit of this is hours since 2009-01-01, so 0 means 2009-01-01T00:00:00 and 36 means 2009-01-02T12:00:00.

The axis=T attribute makes this to be the time dimension variable. The fact that dimension name is time or that variable name is time does not mean anything. They can be named as you wish, but axis is mandatory.

Datafed supports following time units: years, months, days, hours, minutes or seconds


	<variable name="time" type="int" shape="time">
		<attribute name="standard_name" type="string" value="time" />
		<attribute name="long_name" type="string" value="time" />
		<attribute name="units" type="string" value="hours since 2009-01-01" />
		<attribute name="axis" type="string" value="T" />
	</variable>

This variable contains no data initially, the python utility method put_time_slice fills it automatically.

Latitude and Longitude Dimension Variables with Data

Very similar to time variable, but these dimensions are fixed, so they must be filled with data.

The node <values start="6.2" increment="0.05" /> fills the array with 6.2, 6.25, 6.3 etc...

Another method is to enumerate all the values like <values>6.2, 6.25, 6.3 6.35<values/>

This is useful if the dimension is not linear, like elevation dimension by atmospheric pressure often is.

As usual, axis attribute marks the geographical dimensions. names lat and lon are not significant.


	<variable name="lat" type="double" shape="lat">
		<attribute name="standard_name" type="string" value="latitude" />
		<attribute name="long_name" type="string" value="latitude" />
		<attribute name="units" type="string" value="degrees_north" />
		<attribute name="axis" type="string" value="Y" />
		<values start="6.2" increment="0.05" />
	</variable>
	<variable name="lon" type="double" shape="lon">
		<attribute name="standard_name" type="string" value="longitude" />
		<attribute name="long_name" type="string" value="longitude" />
		<attribute name="units" type="string" value="degrees_east" />
		<attribute name="axis" type="string" value="X" />
		<values start="-91.6" increment="0.05" />
	</variable>

Data Variables

A variable without axis attribute is a data variable.

Since they have time dimension, they are initially empty so there's no data. But attributes _FillValue and missing_value are important.

The attribute _FillValue has a special meaning deep in the NetCDF library. Since the two variables share the same time dimension, if you append data into one, the other array must be written too. In this case, the NetCDF library writes it with the _FillValue.

Datafed supports _FillValue NaN for floats and doubles.

In principle missing_value could be different from _FillValue, but in practice they should be the same.


	<variable name="PM2_5" type="float" shape="time lat lon">
		<attribute name="long_name" type="string" value="PM2.5 concentration" />
		<attribute name="units" type="string" value="micrograms/m^3" />
		<attribute name="_FillValue" type="float" value="NaN" />
		<attribute name="missing_value" type="float" value="NaN" />
	</variable>
	<variable name="O3" type="float" shape="time lat lon">
		<attribute name="long_name" type="string" value="Ozone" />
		<attribute name="units" type="string" value="ppmV" />
		<attribute name="_FillValue" type="float" value="NaN" />
		<attribute name="missing_value" type="float" value="NaN" />
	</variable>

Important Information about _FillValue!

The type of _FillValue must be the same as the type of the variable itself. If not, writing may fail with very odd error messages.

Closing Tag is required


</netcdf>

Creating NetCDF Files using Python

Using the NCML file above, you can create the NetCDF file with a few lines of python:

The CMAQ_Baron_20.ncml file must exist, and a completely new CMAQ_Baron_20.nc file will be created.

import datafed
from datafed import cf1

cf1.create_ncml22("CMAQ_Baron_20.nc", "CMAQ_Baron_20.ncml")

This creates the dimensions and dimension variables, and leaves the data empty.

Now we can add a spacial slice of a given time:

import datafed
from datafed import cf1
import datetime

nc = cf1.open("CMAQ_Baron_20.nc", "w")

dt = datetime.datetime(2009, 1, 2)

# create array for one time slice
# 220 latitudes, 288 longitudes, NaN as initial value
data = cf1.create_array([220, 288], float("NaN")) 
for line in open("data.txt"): 
    #data.txt must have lines
    #x-index, y-index, value
    #
    #0, 0, 5.4
    #0, 1, 7.6
    #etc...

    x, y, v = line.split(",")
    data[int(y)][int(x)] = float(v)

nc.put_time_slice("PM2_5", slice, dt)

This will write the data array into PM2_5 variable, and also write "24" into the datetime variable, since the time is 24 hours from the time dimension start 2009-01-01.

Notice, that put_time_slice only supports updating or appending, inserting into the middle is not possible.