Difference between revisions of "Talk:WCS Access to netCDF Files"

Latest revision as of 04:46, July 17, 2009

-- Rhusar 06:46, 17 July 2009 (EDT) -- Rhusar 06:46, 17 July 2009 (EDT)

Hello,

At the galeon team wiki site:

http://sites.google.com/site/galeonteam/Home/plan-for-cf-netcdf-encoding-standard

I put together a rough draft outline of a plan for establishing CF-netCDF as an OGC binary encoding standard. Please note that this is a strawman. Comments, suggestions, complaints, etc. are very welcome and very much encouraged. It would be good to have the plan and a draft candidate standard for the core in pretty solid shape by early September -- 3 weeks before the next OGC TC meeting which starts on September 28.

One issue that requires airing early on is the copyright for any resulting OGC specification documents. Carl Reed, the OGC TC chair indicates that the wording normally used in such documents is:

   Copyright © 2009, <name(s) of organizations here>
   The companies listed above have granted the Open Geospatial Consortium, Inc. (OGC) a nonexclusive, royalty-free, paid up, worldwide license to copy and distribute this document and to modify this document and distribute copies of the modified version.

I'm sending a copy of this to our UCAR legal counsel to make sure we are not turning over ownership and control of the CF-netCDF itself..

-- Ben

Hi Ben,

Firstly -- applause, applause! This is an important step. Thanks so much for leading it.

If it is not too late, however, I'd like to open a discussion on a rather significant change in the approach. As outlined at the URL you provided the approach focuses on "CF-netCDF as an OGC binary encoding standard". Wouldn't out outcomes be more powerful and visionary if instead we focussed on the netCDF API as an OGC standard? Already today we see great volumes of GRIB-formatted data that are served as-if NetCDF through OPeNDAP -- an illustration of how the API as a remote service becomes a bridge for interoperability. The vital functionalities of aggregation, and augmentation via NcML are about exposing *virtual* files -- again, exposing the API, rather than the binary encoding.

It is the ability to access remote subsets of a large netCDF virtual dataset, where we see the greatest power of netCDF as a web service. While this can be implemented as a "fileout" service (the binary encoding standard approach) -- and that has been done successfully in WCS and elsewhere -- it does not seem like the optimal strategy. It is the direct connection between data and applications (or intermediate services) -- i.e. the disappearance of the "physical" (binary) file -- which seems like the service-oriented vision. This would not eliminate the ability of the standard to deliver binary netCDF files in the many cases where that is the desired result. Simple REST fileout services are desirable and should perhaps be included as well in this standards package.

David Artur (OGC representative) indicated at the meeting where we met with him in May that there were other examples of standardizing APIs within OGC. He also mentioned that with a community-proven interoperability standard the OGC process can be relatively forgiving and streamlined (fingers crossed ... lets hope). As I understand it, the most recent documents from GALEON allow for an OPeNDAP URL as the payload of WCS. So the concept of the API standard -- the reference to the file, rather than the binary file itself -- has already made its way into the GALEON work, too. I imagine there have already been discussions about this point. Very interested to hear yours and other's thoughts.

   - Steve

Hi,

I think one needs to standardize BOTH – an access API and an encoding, AND to do this in a way that they work with one another. It is for this reason (as an example) that GML exposes the source data model (as well as acting as the data encoding for transport) so that WFS can define requests in a neutral manner. It should NOT be a matter of ONE or the OTHER. You might also look at the work of the XQuery Data Model group.

R

Ron, I am unfamiliar with GML, and I am not sure I understand what you are saying. I think of the encoding for _transport_ as a very different thing than an encoding for files. If I am not mistaken, the netCDF API provides an encoding for transport also. No?

Which brings me to Steve's email, with which I agree in broad terms. One thing that CF has that is not explicit/required in the netCDF API definition is at least the possibility of providing one standard name for each variable. (More would be better, but one step at a time....) I am sure this information makes it across the API when it is provided, but to be honest, in this day and age spending a lot of time standardizing the API, while remaining quiet about the semantics of the transported information, does not seem cost-effective. I think there might be some easy strategies for bridging that gap (mostly by insisting on CF-compliant data on the far side of the interface).

John

Ron Lake

OK – perhaps I misspoke – I am not that familiar with NetCDF API – often an API defines just Request/Response and uses something else for the transport – that is the case for GML/WFS or GML/WCS. I have just often observed in OGC a conflict between encodings and API’s when we should focus on the two together. Sometimes the API folks want to enable many transport encodings and the encoding people want to support many request/response API’s etc.

John Blower

Hi Ron, all,

I think it's confusing to talk about "the NetCDF API", because in reality there are lots of APIs at work in reading data using what might loosely be called "NetCDF technologies". So when we talk about "standardizing NetCDF APIs through OGC" we could be talking about several different things:

1) Standardizing the NetCDF data model as a means of structuring array-based information (this could be an implementation of a Coverage, in fact Bryce Nordgren has compared the NetCDF data model with ISO19123 Coverages). The data model describes a kind of language-independent API. Importantly, lots of file formats can be modelled using the NetCDF data model.

2) Standardizing the NetCDF file format as a means of encoding data on disk. There are APIs in many languages for reading this format.

3) Standardizing the Climate and Forecast metadata conventions as a means of georeferencing the arrays and adding semantics. The interpretation of these conventions requires another API.

4) Standardizing the Data Access Protocol as a request-response mechanism for getting data using web services. The request-response mechanism is another API.

In the NetCDF community, we are very accustomed to simply using the second type of API in our programs, with the rest of the APIs being handled transparently behind the scenes in our tools.

The following expansion is intended for those who are unfamiliar with NetCDF technologies - Unidata guys can go to sleep now!

Very briefly, the NetCDF data model considers Datasets, which contain Variables (temperature, salinity etc), which contain Arrays of data. There are structures for holding coordinate systems for the data in the Arrays. Georeferencing is achieved through the use of attributes, whose names are standardized in the Climate and Forecast (CF) conventions.

In terms of data transport, we always have the possibility to just transfer NetCDF files from place to place. However, Steve hit the nail on the head when he said:

> It is the direct connection > between data and applications (or intermediate services) -- i.e. the > disappearance of the "physical" (binary) file -- which seems like the > service-oriented vision.

We can create "virtual" datasets, then expose them through the Data Access Protocol. The data model of the DAP is very close to that of NetCDF, so data transport on the wire is very nearly lossless. The client can get a handle to a Variable object, which might actually reside physically on a remote server, and whose data might actually be spread across different files. It's extremely powerful and useful. (It's even more powerful when you consider that the NetCDF data model can be applied to many different file formats such as GRIB, the WMO standard. This means that the "NetCDF Variable" in question might actually be a virtual variable consisting of a thousand individual GRIB files.)

One key difference between this method and GML/WFS is that the DAP protocol knows nothing about geographic information: this information is carried in the (CF-compliant) attributes, which require interpretation by an intelligent client. Also, the data are transported as arrays in compressed binary format so there's little chance of a human being able to interpret the data stream on the wire.

However, this allows the efficient transport of large data volumes.

The opaqueness of the DAP is handled through the use of tools: humans hardly ever construct DAP requests manually.

Hope this helps, Jon

Gerry Creager

Jon does a good job of identifying the issues and explaining common usage, something I'd identified on the to-do list but had not approached yet. I agree with him on all points below.

I remain concerned, though, that by pressing ahead in this, without identifying a method for CF to address irregular grids and better address point coverages in NetCDF, we will create a situation where we ratify a new standard, and then turn around and immediately have to revisit a lot of the same issues, but from a different view-point. And understand: This isn't in the pattern of creating a spec and then empaneling an RWG immediately, but rather, sweeping the difficult part (irregular grid coverages, common in ocean/coastal/marine applications) under the rug while we ratify what we know works well already. I would like to see that hard part addressed as an element initially. The fundamental data model needs to address this issue, and there's not a good point to ratifying what's effectively a standard already, through common use (note Jon's comment on how NetCDF is used in the community already), with the knowledge that we have to redo the whole exercise, almost immediately.

gerry

Ben Domenico

Hi all,

This is a fascinating conversation -- perhaps too fascinating. I'd really like to break it out into components which is what I was trying to do by proposing a core and extensions approach to standardization. In my plan we start with a core standard. In the NASA version, 14 pages sufficed. It contains all the information needed to understand a "netCDF object." I like that.

http://www.esdswg.org/spg/rfc/esds-rfc-011/ESDS-RFC-011v1.00.pdf

NetCDF core standard plus extensions for each CF convention

I also propose we develop the first (of perhaps several) extensions. That extension will describe the CF-conventions for gridded data which are mature and in wide use.

   http://cf-pcmdi.llnl.gov/documents/cf-conventions

   http://cf-pcmdi.llnl.gov/documents/cf-conventions/1.4/cf-conventions.html#grid-mappings-and-projections

The CF conventions are essential for understanding the "what, where, when" semantics of a netCDF object. In the future I envision extensions for additional data types -- as Gerry suggests. The recently proposed CF conventions for point/station data would be an obvious next step. But in my plan, it would be a next step. We would not try to do this all at once.

API extensions to the core?

As to the API, Jon Blower rightly points out that there are many netCDF APIs. Perhaps at some point, we could also develop extensions for each API. On the other hand maybe some aspects of this standardization can be accomplished by pointing to the netCDF documentation which is carefully written and consistently maintained. I believe the KML spec does this, but I have to look into examples of API standardization more carefully.

The netCDF data model

The basic elements of the netCDF data model are described in the NASA expression of the standard. In addition, there is a formal publication for which Stefano was the lead author that maps the netCDF Commmon Data Model to the ISO 19123 data model. I'll try to dig up the URL for the online version of this publication.

Bottom line

This discussion reinforces my conviction that we have to take a stepwise approach to this, starting with the core standard where we already have something in place with NASA and at least one extension, namely, the CF conventions for gridded data. As we work on these, we can continue the discussion of additional extensions for different APIs and different CF conventions .

I would like to have the core standard in place within 6 months and the first extension shortly after that. At that point we will have discussed these other possible extensions and have a plan in place for making them happen as we come to agreement on them.