Talk:Streaming and or netCDF File

From Earth Science Information Partners (ESIP)

Streaming netCDF -- Michael Decker (MDecker) 06:40, 14 July 2011 (MDT)

Is it possible to stream netcdf output files to the client on-the-fly without the need to save them locally on disk first? this would eliminate a disk i/o bottleneck and also ease possible restrictions on the maximum size of requested datasets. (this would at least be useful for the store=false mode of operation). In this context we should also explore if there are any benefits from using openDAP to some extent. Another idea we had was to find some way to reliably estimate (or predict) the total dataset size before delivery (and possibly before assembly of the result on the server). This would be useful on the server side to restrict queries that will be too big as well as on the user side to give an idea about likely download durations.

Re: Streaming netCDF -- Hoijarvi 06:42, 14 July 2011 (MDT)

Streaming huge files is beneficial, because you can start writing to output before reading the data has finished. Todd Plessel at EPA uses streaming, I've heard. The Unidata NetCDF library does not support it directly, but if you restrict the writing order: attributes first, then variables in the defined order, it might be possible to add streaming support to the netcdf library. We should ask unidata netcdf developers. About the IO bottleneck: modern file caching, especially in windows when you open the file with temporary bit on, is quite fast. We definitely need to benchmark the IO speed first.

Re: Re: Streaming netCDF -- Michael Decker (MDecker) 06:51, 14 July 2011 (MDT)

That sounds interesting, I would love to hear about that. Streaming on-the-fly could improve reaction times of the server a lot.
The Unidata NetCDF library does not support it directly, but if you restrict the writing order: attributes first, then variables in the defined order, it might be possible to add streaming support to the netcdf library. We should ask unidata netcdf developers.
Yes, I have been considering something like that, but without explicit support from the low-level netCDF libs, you can never be quite sure if it will work in all cases. The netcdf4-python module offers a sync/flush call that would probably help hacking this if there is no reliable official way to do it.

Re: Re: Streaming netCDF -- MSchulz 06:56, 14 July 2011 (MDT)

Concerning the streaming idea: it would indeed be wonderful to hear from Todd, Ben or Ethan how it is done/can be done or should be done.

Re: Re: Streaming netCDF -- Todd Plessel (Todd) 12:56, 14 July 2011 (MDT)

The concept is for data to be streamed over a socket from one process (often on a remote computer)into another process without the need to first write it to a transient disk file (on the local computer).
Such socket I/O (with appropriately-configured buffer sizes [Note 1 at end]) is considerably faster than disk writes/reads.
(I don't have my I/O benchmarks handy but I was astonished to learn a long time ago that, for large files, ftp-localhost-put was faster than cp.) For such streaming to be maximally efficient it is important that the data format be 'predictively parsable'.
This means that each read is for a pre-determined/expected amount of data - either fixed-size or previously-read - no temporary buffer guessing. For example, suppose the input to some program is a variable-length list of numbers.

A non-predictively-parsable format would rely on some sentinel
(EOF, ", </end>, etc.) to terminate the list. E.g., <list> 1.1 2.2 3.3 4.4 5.5 </list> This is inefficient since the reading program must guess/allocate/reallocate+copy some buffer as it reads the data (assume an arbitrary-length data line or number of data lines between the delimiters).
(Even using convenient language features such as std::vector<double> numbers; while (...) { ...; numbers.push_back( value ); }Only hides the inefficiencies (over-allocating, re-allocating & copying, or with linked-lists, the extra memory used for the pointers).
Whereas a predictively-parsable format such as: 5 1.1 2.2 3.3 4.4 5.5 Is both straight-forward and more efficientIs both straight-forward and more efficient(and common since FORTRAN days),allowing the program to first read the length of the list(and allocate once exactly the needed memory to hold it)then read exactly the expected number of values.
An example of a decades-old streamable data format is ESRI Spatial Database Engine (SDE) C API which allows Shapefile-format data to be streamed over sockets between processes (possibly on different computers).

Another example is RSIG's 'FORMAT=XDR'

http://badger.epa.gov/rsig/webscripts.html#output which has a fixed-size (predictively parsable) ASCII header followed by portable XDR (IEEE, big-endian) format binary data arrays.
Here is an example for GASP-AOD satellite data (to pick a non-trivial case): GASP 1.0
http://www.ssd.noaa.gov/PS/FIRE/GASP/gasp.html

2008-06-21T00:00:00-0000

  1. Dimensions: variables timesteps scans:

3 24 14

  1. Variable names:

longitude latitude aod

  1. Variable units:

deg deg -

  1. Domain: <min_lon> <min_lat> <max_lon> <max_lat>

-76 34 -74 36

  1. MSB 64-bit integers (yyyydddhhmm) timestamps[scans] and
  2. MSB 64-bit integers points[scans] and
  3. IEEE-754 64-bit reals data_1[variables][points_1] ... data_S[variables]

[points_S]: <binary data arrays described above are here...>
Notes:
1. The ASCII header is always 14 lines long .
2. There are always 3 dimensions (line 5)
3. integers points[scans] (line 13 ) tells how many data points there are per scan.
4. reals data_1..S are the data, per variable and scan.
Having read (and summed) the points[], the size of all data arrays is known before reading them.
In general, before reading any variable-length data, enough information has already been read to know how to read the next piece of data. So it is predictively parsable, efficient, human-readable and straight-forward.It is so simple that it does not require any format-specific API to read.
This no-API-required is an important point for supporting the web-services philosophy of portable, platform/language/technology-independence.(Also, often fewer lines of native I/O coded are needed to read the above format than a comparable file (below) utilizing the NetCDF API.)
Examples of non-streamable formats include NetCDF and HDF since these both require format-specific (and language-specific) API libraries to read the data and, per those APIs, the data must be in files.
E.g., nc_open() takes a file name rather than a FILE*.
(FILE* is returned from pipes popen() and even sockets via fdopen(socket()).)
Perhaps this could be remedied in future versions of these libraries.
Here is an ASCII dump of GASP-AOD in NetCDF-COARDS-format: $ ncdump gasp1.nc netcdf gasp1 { dimensions: points = 802 ; variables: float longitude(points) ; longitude:units = "degrees_east" ; float latitude(points) ; latitude:units = "degrees_north" ; float aod(points) ; aod:units = "none" ; aod:missing_value = -9999.f ; int yyyyddd(points) ; yyyyddd:units = "date" ; int hhmmss(points) ; hhmmss:units = "time" ; float time(points) ; time:units = "hours since 2008-06-21 11:15:00.0 -00:00" ; // global attributes:

west_bound = -76.f ;
east_bound = -74.f ;
south_bound = 34.f ;
north_bound = 36.f ;
Conventions = "COARDS" ;
history = "http://www.ssd.noaa.gov/PS/

FIRE/GASP/gasp.html,XDRConvert" ; data: longitude = -74.00761, -74.03289, -74.00693, -74.03222, -74.00624, ...
Is the above example data format predictively-parsable?Is the above example data format predictively-parsable?
Not unless it is understood that there is only ever one dimension, called points, and there are only ever 6 variables, called longitude...aod, ...,time. In general, the ASCII dump of arbitrary NetCDF files are not predictively parsable. However, the NetCDF API allows for querying the number of dimensions before reading them, and the number and type of variables, before reading them, etc.
So, given reliance on an API, such data can be made predictively parsable and thus efficiently streamable. It is up to Russ Rew and staff to consider the devilish implementation details.

So to summarize, for large scientific datasets,I prefer simple, predictively-parsable, straight-forward,portable, platform/language/technology-independent (no-API-required),efficient, high-performance, binary streamable data formats. I realize that NetCDF, HDF, etc. are not going away anytime soon so we (data consumers) must grapple with their complexities and inefficiencies.
But at least, if/when we create new data formats, consider the above qualities.(Years ago, I developed a comprehensive Field Data Model with the above qualities and much more, but never finished it.)Finally, as I've said before, universal interoperability requires a single comprehensive (non-federated) data model,not just use of general-purpose data formats such as NetCDF, HDF, etc.
Conventions such as COARDS and CF, are a step in the right direction, but far from a comprehensive data model.

[Note 1] Regarding I/O buffer sizes:

for sockets these can (and should) be portably set to 256KB (using setsockopt( theSocket, SO_SNDBUF and SO_RCVBUF, 256 * 1024 )) which results in faster performance over the puny default sizes.

File I/O buffer sizes are not as easy to set portably. However, I discovered that NetCDF has its own internal copy buffers and they can be increased to 256KB by changing nc.c,nc_open() size_t chunk_size_hint = 256 * 1024;
which yields improved performance, but still much slower than direct fread/fwrite calls.
(I also replaced the swapn4b byte-swapping routine with a pipelined, optionally parallel implementation which also boosted NetCDF performance.)