Talk:Streaming and or netCDF File

From Federation of Earth Science Information Partners

Streaming netCDF -- MDecker 06:40, 14 July 2011 (MDT)

Is it possible to stream netcdf output files to the client on-the-fly without the need to save them locally on disk first? this would eliminate a disk i/o bottleneck and also ease possible restrictions on the maximum size of requested datasets. (this would at least be useful for the store=false mode of operation). In this context we should also explore if there are any benefits from using openDAP to some extent. Another idea we had was to find some way to reliably estimate (or predict) the total dataset size before delivery (and possibly before assembly of the result on the server). This would be useful on the server side to restrict queries that will be too big as well as on the user side to give an idea about likely download durations.

Re: Streaming netCDF -- Hoijarvi 06:42, 14 July 2011 (MDT)

Streaming huge files is beneficial, because you can start writing to output before reading the data has finished. Todd Plessel at EPA uses streaming, I've heard. The Unidata NetCDF library does not support it directly, but if you restrict the writing order: attributes first, then variables in the defined order, it might be possible to add streaming support to the netcdf library. We should ask unidata netcdf developers. About the IO bottleneck: modern file caching, especially in windows when you open the file with temporary bit on, is quite fast. We definitely need to benchmark the IO speed first.

Re: Re: Streaming netCDF -- MDecker 06:51, 14 July 2011 (MDT)

That sounds interesting, I would love to hear about that. Streaming on-the-fly could improve reaction times of the server a lot.
The Unidata NetCDF library does not support it directly, but if you restrict the writing order: attributes first, then variables in the defined order, it might be possible to add streaming support to the netcdf library. We should ask unidata netcdf developers.
Yes, I have been considering something like that, but without explicit support from the low-level netCDF libs, you can never be quite sure if it will work in all cases. The netcdf4-python module offers a sync/flush call that would probably help hacking this if there is no reliable official way to do it.

Re: Re: Streaming netCDF -- MSchulz 06:56, 14 July 2011 (MDT)

Concerning the streaming idea: it would indeed be wonderful to hear from Todd, Ben or Ethan how it is done/can be done or should be done.

Re: Re: Streaming netCDF -- TPlassel 12:56, 14 July 2011 (MDT)

The concept is for data to be streamed over a socket from one process (often on a remote computer)into another process without the need to first write it to a transient disk file (on the local computer).
Such socket I/O (with appropriately-configured buffer sizes [Note 1 at end]) is considerably faster than disk writes/reads.
(I don't have my I/O benchmarks handy but I was astonished to learn a long time ago that, for large files, ftp-localhost-put was faster than cp.) For such streaming to be maximally efficient it is important that the data format be 'predictively parsable'.
This means that each read is for a pre-determined/expected amount of data - either fixed-size or previously-read - no temporary buffer guessing. For example, suppose the input to some program is a variable-length list of numbers.

A non-predictively-parsable format would rely on some sentinel
(EOF, ", </end>, etc.) to terminate the list. E.g., <list> 1.1 2.2 3.3 4.4 5.5 </list> This is inefficient since the reading program must guess/allocate/reallocate+copy some buffer as it reads the data (assume an arbitrary-length data line or number of data lines between the delimiters).
(Even using convenient language features such as std::vector<double> numbers; while (...) { ...; numbers.push_back( value ); }Only hides the inefficiencies (over-allocating, re-allocating & copying, or with linked-lists, the extra memory used for the pointers).
Whereas a predictively-parsable format such as: 5 1.1 2.2 3.3 4.4 5.5 Is both straight-forward and more efficientIs both straight-forward and more efficient(and common since FORTRAN days),allowing the program to first read the length of the list(and allocate once exactly the needed memory to hold it)then read exactly the expected number of values.
An example of a decades-old streamable data format is ESRI Spatial Database Engine (SDE) C API which allows Shapefile-format data to be streamed over sockets between processes (possibly on different computers).

Another example is RSIG's 'FORMAT=XDR'

http://badger.epa.gov/rsig/webscripts.html#output which has a fixed-size (predictively parsable) ASCII header followed by portable XDR (IEEE, big-endian) format binary data arrays.
Here is an example for GASP-AOD satellite data (to pick a non-trivial case): GASP 1.0
http://www.ssd.noaa.gov/PS/FIRE/GASP/gasp.html

2008-06-21T00:00:00-0000

  1. Dimensions: variables timesteps scans:

3 24 14

  1. Variable names:

longitude latitude aod

  1. Variable units:

deg deg -

  1. Domain: <min_lon> <min_lat> <max_lon> <max_lat>

-76 34 -74 36

  1. MSB 64-bit integers (yyyydddhhmm) timestamps[scans] and
  2. MSB 64-bit integers points[scans] and
  3. IEEE-754 64-bit reals data_1[variables][points_1] ... data_S[variables]

[points_S]: <binary data arrays described above are here...>
Notes:
1. The ASCII header is always 14 lines long .
2. There are always 3 dimensions (line 5)
3. integers points[scans] (line 13 ) tells how many data points there are per scan.
4. reals data_1..S are the data, per variable and scan.
Having read (and summed) the points[], the size of all data arrays is known before reading them.
In general, before reading any variable-length data, enough information has already been read to know how to read the next piece of data. So it is predictively parsable, efficient, human-readable and straight-forward.It is so simple that it does not require any format-specific API to read.
This no-API-required is an important point for supporting the web-services philosophy of portable, platform/language/technology-independence.(Also, often fewer lines of native I/O coded are needed to read the above format than a comparable file (below) utilizing the NetCDF API.)
Examples of non-streamable formats include NetCDF and HDF since these both require format-specific (and language-specific) API libraries to read the data and, per those APIs, the data must be in files.
E.g., nc_open() takes a file name rather than a FILE*.
(FILE* is returned from pipes popen() and even sockets via fdopen(socket()).)
Perhaps this could be remedied in future versions of these libraries.
Here is an ASCII dump of GASP-AOD in NetCDF-COARDS-format: $ ncdump gasp1.nc netcdf gasp1 { dimensions: points = 802 ; variables: float longitude(points) ; longitude:units = "degrees_east" ; float latitude(points) ; latitude:units = "degrees_north" ; float aod(points) ; aod:units = "none" ; aod:missing_value = -9999.f ; int yyyyddd(points) ; yyyyddd:units = "date" ; int hhmmss(points) ; hhmmss:units = "time" ; float time(points) ; time:units = "hours since 2008-06-21 11:15:00.0 -00:00" ; // global attributes:

west_bound = -76.f ;
east_bound = -74.f ;
south_bound = 34.f ;
north_bound = 36.f ;
Conventions = "COARDS" ;
history = "http://www.ssd.noaa.gov/PS/

FIRE/GASP/gasp.html,XDRConvert" ; data: longitude = -74.00761, -74.03289, -74.00693, -74.03222, -74.00624, ...
Is the above example data format predictively-parsable?Is the above example data format predictively-parsable?
Not unless it is understood that there is only ever one dimension, called points, and there are only ever 6 variables, called longitude...aod, ...,time. In general, the ASCII dump of arbitrary NetCDF files are not predictively parsable. However, the NetCDF API allows for querying the number of dimensions before reading them, and the number and type of variables, before reading them, etc.
So, given reliance on an API, such data can be made predictively parsable and thus efficiently streamable. It is up to Russ Rew and staff to consider the devilish implementation details.

So to summarize, for large scientific datasets,I prefer simple, predictively-parsable, straight-forward,portable, platform/language/technology-independent (no-API-required),efficient, high-performance, binary streamable data formats. I realize that NetCDF, HDF, etc. are not going away anytime soon so we (data consumers) must grapple with their complexities and inefficiencies.
But at least, if/when we create new data formats, consider the above qualities.(Years ago, I developed a comprehensive Field Data Model with the above qualities and much more, but never finished it.)Finally, as I've said before, universal interoperability requires a single comprehensive (non-federated) data model,not just use of general-purpose data formats such as NetCDF, HDF, etc.
Conventions such as COARDS and CF, are a step in the right direction, but far from a comprehensive data model.

[Note 1] Regarding I/O buffer sizes:

for sockets these can (and should) be portably set to 256KB (using setsockopt( theSocket, SO_SNDBUF and SO_RCVBUF, 256 * 1024 )) which results in faster performance over the puny default sizes.

File I/O buffer sizes are not as easy to set portably. However, I discovered that NetCDF has its own internal copy buffers and they can be increased to 256KB by changing nc.c,nc_open() size_t chunk_size_hint = 256 * 1024;
which yields improved performance, but still much slower than direct fread/fwrite calls.
(I also replaced the swapn4b byte-swapping routine with a pipelined, optionally parallel implementation which also boosted NetCDF performance.)

Re: Re: Re: Streaming netCDF -- MSchultz 07:39, 14 July 2011 (MDT)

Thanks, Todd - this was very informative. Just to take it one step further: if we assume we want to stick with netcdf for a while (simply because it is widely used and we already have something working based on this format). Then what exactly would be required from the current (or a new) netcdf version to allow for (efficient) streaming? From your mail, I take the following points:

  1. change the nc_open API or add a new routine "nc_fopen" which accepts FILE* arguments as well as file names. "E.g., nc_open() takes a file name rather than a FILE*. (FILE* is returned from pipes popen() and even sockets via fdopen(socket()).) Perhaps this could be remedied in future versions of these libraries."
  2. make sure that data sizes can be efficiently obtained before the actual data are read. If I understand correctly, then this has two major ramifications:
  3. rather than looping through individual dimensions and variables, it might be more efficient if an API command existed which would return the entire header in a predefined structure.

Naturally this would come with limitations and some memory overhead, but perhaps it would be possible to define a structure that is good for 99.9% of all netcdf files. This could be some record structure like (pseudo code):

    dims[32]           # personally I never saw a netcdf file with more than 8 dimensions
       { long(len)
         char[80] name }
    vars[256]          # again, in my experience 100 takes it very far already...
       { vardims[8]    # indices to dims
         char[80] name
         char[1024] attributes } # the format of this would need to be defined
    char[1024] global_attr

A structure like this would have a predefined size of 290K (with a little tweaking one should be able to fit this into a 256K block) and you would then have all the size information available that you need to stream the actual data (= variable content). Of course some provision must exist for (the probably very rare) cases where the dimension limits of this predefined structure are exceeded. In this case some error code would force the system to revert to normal API use or simply to fail. In the case of attributes one can perhaps establish a "preference list" so that important attributes like units are always included and if the buffer length is exceeded the rest will simply be cut off. One would still have the chance to go back and get the rest via the traditional get_attribute way. ...perhaps one would wish to include an "int state" flag for each variable to indicate if the given information is complete?

A more difficult (?) is the problem of slicing. This is at the heart of protocols such as WCS that you can define your own domain and data should be extracted accordingly. The netcdf API allows for slicing based on start:offset:stride indexing. Given these values you will naturally know the dimensionality of your subset, and you can again predefine your read buffer. But how do you get from a domain in coordinate values (including time) to the domain in indices? This requires knowledge of the coordinate values ... Maybe I am just not thinking it through to the end yet, but does this represent a problem for streaming or not?

Are there any other issues specific to netcdf to make it (more) streamable?

Re: Re: Re: Streaming netCDF -- MDecker 07:45, 14 July 2011 (MDT)

Hi Todd, thanks for your response and input. Maybe the context did not become clear enough from that one mail you received alone. My idea of what we need for "netCDF streaming" does not take it as far as you suggested, so it might be easier than what you pointed out (but still difficult enough). Essentially, what I would like to have is simply a way to create a valid netCDF data stream on the fly on the server side. I am not thinking about handling the stream on the client side in any particular way except to save it to a file for storage and later processing, so the client should not require special knowledge for this to work. The streaming would be a server internal issue more or less. Right now our server slices the source files according to the user's WCS request and saves this output file as a temp file. This temp file then gets sent to the user via HTTP either as part of the server response itself (WCS parameter store=false) or as a separate HTTP download initiated by the client (store=true).

Now, what I would like to do is to skip this internal temp file that will be delievered to the user (at least when store=false). Instead I would like to start sending the sliced data to the user as soon as it starts becoming available. I do not think this would require a protocol as elaborate as you described it. What would be needed (in my opinion) is a way to make sure that a netCDF file can be created in a sequential fashion without any need to later modify data earlier written. This means all data once written by the netCDF API must remain valid no matter what comes after.

What I understood about the design of the format (at least for netCDF-3 -- netCDF-4 might look completely different as it is based on HDF5) is the following: you got a header part that contains all the dimension definitions and all attributes (and also probably the variable names), after that comes the variable data.

This means that there are some restrictions on the write order of the file: All dimensions and variables would have to be defined correctly right at the beginning and all attributes would have to be created as well. After that, the variable data would be added one by one. I can certainly not guarantee that the creation of the temp file goes in that order right now.I have no idea if it is actually possible with the current API or can be made possible with relatively little changes. It might not be trivial when I think about it... The client would not see any difference compared to the scenario with the intermediate temp file on the server side - except that the download should start more quickly as the file does not have to be generated completely before download begins.

Re: Re: Re: Streaming netCDF -- Rhusar 08:22, 14 July 2011 (MDT)

Todd, thanks a bunch for the streaming explanations. I am certain that streaming will be a significant issue as we start moving around larger chunks and more diverse data. So, I set up a Streaming_and/or_netCDF_File page fore this discussion page. There we can describe the main features/issues of streaming and CF-netCDF file approaches. Hope you can help us there.

The essence of my thinking is to identify where and when streaming is desirable/doable and when is the file approach more appropriate. As Martin point out below, the hart of the OGC WCS protocol is the space-time subsetting, i.e. slicing and dicing (multi-dim subsetting). These are essential 'features' that we can not do without. AQ data slices are generally modest in size and may not benefit much form streaming (true?). However slices make the life of the client much easier, particularly if these are strongly typed self-describing files, such as CF-netCDF. Such payload files can also carry additional, well structured metadata as further help to the user. So would you help us document the benefits and issues of streaming.. ? You may also just point to something you have written and we can add it to Streaming_and/or_netCDF_File page.

Good point on "universal interoperability requires a single comprehensive (non-federated) data model".

Re: Streaming netCDF -- RRew 06:25, 15 July 2011 (MDT)

Sorry to jump in so late, but Unidata has already implemented experimental netCDF streaming. I've added John Caron to the CC: list, because he's the architect of the netCDF streaming protocol and APIs.

For more information on what's already done, see these web pages on the service API and the streaming protocol:

http://www.unidata.ucar.edu/software/netcdf-java/stream/CdmRemote.html
http://www.unidata.ucar.edu/software/netcdf-java/stream/NcStream.html

Unidata has also recently implemented client side handling for cdmremote and the ncstream protocol in the C library, although it hasn't yet been extensively tested. It's in the netCDF daily snapshot releases, and we intend to make it available in the next release (4.2, which we're planning for the end of 2011).

Re: Re: Streaming netCDF -- JCaron 06:29, 15 July 2011 (MDT)

Ill just add these comments: We have experimented with streaming netcdf (classic format) files. It is possible, though one essentially needs to build a special library to do so, ie you cant use the usual netcdf API, which assumes random access. There still is a problem if you want to use the unlimited dimension: you have to know how many records you will have in the file before you start writing. If thats the case, then one can do it. I can send you some experimental Java code if you are interested.

We decided that there were too many limitations to make this worthwhile for what is just a performance optimization on the server, so we have decided to not support this. (It might be a useful solution to build the file in memory if its small enough).

ncstream is the direction we are currently exploring. This is essentially a new format for netcdf, but one that can be read by both the C and Java library, and easily transformed to netcdf-3 or 4. You can read the docs and ask questions if you are interested.