Difference between revisions of "Candidate Technical Topics"

From Earth Science Information Partners (ESIP)
 
(22 intermediate revisions by 4 users not shown)
Line 2: Line 2:
  
 
==Air Quality Data Network (ADN), Non-IT Issues==
 
==Air Quality Data Network (ADN), Non-IT Issues==
Non-IT Issues. These are all potentially contentious topics, so can only be touched upon, not necessarily resolved.
+
[[AQ Network From Virtual to Real|Virtual to Real]]
 +
Purpose and scope of ADN | AQ network stakeholders | Relationship to integrating initiatives | Governance, legitimacy, impediments
 
===What is the purpose of ADN? ===
 
===What is the purpose of ADN? ===
[[AQ Network From Virtual to Real|Virtual to Real]]
+
* Facilitate access to air quality information (measurements, model results, analyses) in order to enable users (stakeholders) to make better informed choices.
 +
* Boost scientific understanding of air chemistry and pollution transport processes by enabling synthesis views based on multi-platform observations and model ensembles.
 +
* Reduce chaos by improving standardization and documentation of existing data sets and through exposing data processing and quality control procedures
 +
* Ensure reproducibility of scientific analyses by allowing for traceability of data sets and definition of responsibilities within the data chain.
  
===Who are the participating users of the network? What are their roles?===
+
===Who is involved in the ADN and what are their roles?===
 +
* Data providers: mandated data origin vs. scientific data sets; what are the implications of a real ADN for them?
 +
* Data federators/network hubs: who does it and why? Who has a mandate? What are the respective roles and focal points?
 +
* Users: Who are they? What is the interaction with them? Who should become a user but doesn't know about it yet?
 
===What organizations are stakeholders in the network? How do they relate to ADN?===  
 
===What organizations are stakeholders in the network? How do they relate to ADN?===  
 
===What is the ADN governance, legitimacy, impediments===
 
===What is the ADN governance, legitimacy, impediments===
 
===What are the minimum requirements for ADN? ===
 
===What are the minimum requirements for ADN? ===
What few things must be the same, so everything els can be different. Autonomy - Interoperability balance...
+
What few things must be the same, so everything else can be different.  
 +
* Autonomy - Interoperability balance...
 +
* Who formulates the requirements?
 +
* When is an ADN an ADN?
 +
 
 
===What is the scope of AQN? ===
 
===What is the scope of AQN? ===
 
Geographic, Variables, Ambient fixed station observations, satellite observations, emissions, models?  
 
Geographic, Variables, Ambient fixed station observations, satellite observations, emissions, models?  
 
===What data (processing) level served through ADN?===
 
===What data (processing) level served through ADN?===
Is ADN a decider? Why not the Provider? Data Raw, Derived;  
+
Is ADN a decider? Why not the Provider? Data Raw, Derived;
  
 
==Data Servers: Technical Realization (IT) Issues and Solutions==
 
==Data Servers: Technical Realization (IT) Issues and Solutions==
AQ community server issues including netCDF,CF and WCS interoperability standards;Implementation issues for gridded and station data;Development tools;Performance issues
+
netCDF, CF, WCS standards and conventions | Implementation for gridded and station data | Development tools | Server performance
 
===Issues re. the use of netCDF and other data formats===
 
===Issues re. the use of netCDF and other data formats===
 
netCDF is standard format for multi-dimensional data. Cf-netCDF is used both as an archival format of grid data as well as a payload format for WCS queries.  
 
netCDF is standard format for multi-dimensional data. Cf-netCDF is used both as an archival format of grid data as well as a payload format for WCS queries.  
* Issue: ambiguity of CF
+
* Issue: ambiguity and completeness of CF
* Issue: We should define a standard python interface (PyNIO, python-netcdf4, scipy.io.netcdf?)
+
** ''development of a server independent (python) CF-API library''
* Issue: Delivery of (small) data sets in ASCII/csv format
+
*** some (beta) code available at FZJ: [http://repositories.icg.kfa-juelich.de/hg/CommonUtils/file/faad03a63f98/CommonUtils/cf_netcdf.py CommonUtils.cf_netcdf], feel free to suggest a better name
* Issue: Reading of grib data (?)
+
** Brainstorming: What is missing in CF?
 +
** Issue: CF (udunits) time format not the same as ISO Time format (as used by WCS)
 +
*** those two cases can be processed with different code, but uniformity would be less confusing
 +
*** could try to get ISO time recommendation into CF; would still need different code because it's only a recommendation
 +
** Issue: geo-referencing
 +
*** CF offers support for projections (see [http://cf-pcmdi.llnl.gov/documents/cf-conventions/1.5/cf-conventions.html#grid-mappings-and-projections here]), but they are not used by the WCS Server so far
 +
*** CF-Metadata List had some discussions about handling and specification of projections recently, see thread http://mailman.cgd.ucar.edu/pipermail/cf-metadata/2011/007935.html
 +
 
 +
* Issue: We should define a standard NetCDF python interface (PyNIO, python-netcdf4, scipy.io.netcdf?)
 +
* Issue: other output formats
 +
** support fused into server or add-on concept (possibly using the public W*S/NetCDF interface)
 +
** Delivery of (small) data sets in ASCII/csv format?
 +
* Issue: Reading other gridded input data formats? (i.e. GRIB)
 +
* Issue: traceability and revision tracking of datasets (in WCS metadata as well as in NetCDF metadata)
 +
* Info: Cf standard name grammar: [http://www.met.reading.ac.uk/~jonathan/CF_metadata/14.1/ CF metadata grammar concept]
 +
 
 +
===Server co-development tools, methods ===
 +
Server code is maintained through SourceForge (bugtracker, tar balls), Darcs code repositories are available at WUSTL and in Juelich.
 +
* Issue: Version control
 +
** maintain a common codebase
 +
* Issue: Documentation
 +
** mainly inline tech documentation to date
 +
** need more documentation regarding operation
 +
** proposal: [http://sphinx.pocoo.org/ sphinx] for proper documentation (easy to include inline tech doc)
 +
 
 
===Use of WMS, WCS, WFS .. in combination?===
 
===Use of WMS, WCS, WFS .. in combination?===
 
Data display/preview is through WMS. AQ data can be delivered through WCS, WFS. In AComServ, WCS for transferring ndim grid and point-station data; WFS for deliver monitoring station descriptions.  
 
Data display/preview is through WMS. AQ data can be delivered through WCS, WFS. In AComServ, WCS for transferring ndim grid and point-station data; WFS for deliver monitoring station descriptions.  
 
* Issue: WMS interface for preview; "latest" token for dynamic links?
 
* Issue: WMS interface for preview; "latest" token for dynamic links?
 +
** generic WMS service operating on external wcs
 +
** "latest" token could be realized on WCS-WMS interface by using the metadata on the WMS/client side and requesting the latest time; latest time could be default response of WMS server if nothing else requested
  
=== WCS versions===
 
WCS is implemented in multiple versions: 1.0, 1.12, 2.0. The AQ Community Server (AComServ) is now implemented using WCS 1.1.2. Define here the WCS version (WCS 2.0) issue in about one sentence
 
 
===Gridded data service through WCS===
 
===Gridded data service through WCS===
 +
WCS is implemented in multiple versions: 1.0, 1.12, 2.0. The AQ Community Server (AComServ) is now implemented using WCS 1.1.2.
 
This generally works well.
 
This generally works well.
* Issue: Extraction of vertical levels?
+
* Issue: serve "virtual" WCS datasets with continuous time line assembled from many source files
* Issue: Ambiguity of WCS; core plus extended (do we know what is valid?)
+
** clients should only have to do one query to receive the whole times series in one piece instead of requiring the client-side logic to request multiple pieces
 +
** could create a "wrapper" module that can handle such cases with knowledge of the server-side file structure
 +
*** Kari has already done something like this for HTAP datasets, this could be a starting point
 +
* Issue: desirable time filtering options in WCS: hour of day, day of week, day of month, etc.
 +
* Issue: Extraction of vertical levels
 +
** already defined through the rangesubset/fieldsubset parameter
 +
** is this definition OK for us or would we need something else/better?
 +
*** potential problem: only enumeration of levels possible, no ranges
 +
* Issue: current state of WCS 2.0? core released, but extensions still in draft (how do we know/keep track of what is currently valid?)
  
 
===Delivery of station-point data===
 
===Delivery of station-point data===
* Issue: use WCS or WFS, Combination of both?
+
* Issue: use WCS or WFS, Combination of both? (see also [[WCS_Server_Software#WCS_Server_for_Station-Point_Data_Type]])
* Access rights?
+
 
 +
=== Access rights ===
 +
* Issue: technical options to restrict access to datasets?
  
 
===Data server performance issues/solutions? ===
 
===Data server performance issues/solutions? ===
Define performance issues, measurements
+
Define performance issues, measurements, ideas
===Server co-development tools, methods ===
 
Server code is maintained through SourceForge, Darcs code repositories are available at WUSTL and in Juelich.
 
* Issues: Version control, Platform independence, Documentation
 
  
===Relationship to non-AComServ WCS servers===
+
* Issue: especially big datasets take a long time to prepare for delivery (slicing/subsetting, etc.)
* Issue: protocol compatibility, standard compliance, data format(s)
+
** direct streaming of datasets to the client could be part of the solution, [[Streaming_and_or_netCDF_File|click here]] for details
 +
** generated datasets could be cached for a while, so they could be delivered again when there is a request with compatible parameters
 +
** problem: both proposals might be mutually exclusive to some degree
 +
* Issue: management overhead when opening NetCDF
 +
** when opening a NetCDF file, some metadata has to be read and data structures have to be set up
 +
*** input files could be kept open for a while to avoid this overhead
 +
* Issue: temp file space is limited on WCS server
 +
** streaming approach for store=false parameter would not require additional local storage
 +
** temp file approach for store=true parameter could be limited by a maximum dataset size
 +
*** requires a reliable output file size estimator
 +
*** server would return an exception if estimated size is over given threshold
 +
*** would force people to use store=false for large datasets
 +
*** should not violate WCS 1.1 standard (too badly) as only store=false is mandatory
 +
* Issue: XML Metadata assembly might take a long time depending on the catalog content, i.e. with a lot of Identifiers
 +
** GetCapabilities response Metadata is very static anyway, other responses (DescribeCoverage) could be cached for a while
 +
*** attention: DescribeCoverage response depends on parameters
 +
** minor issue compared to actual data delivery performance
  
 +
===Relationship to non-AComServ (non-NetCDF) WCS servers===
 +
* data format(s)
 +
** most WCS clients don't understand NetCDF
 +
* Issue: protocol compatibility
 +
** might need to implement more optional features of WCS
 +
* standard compliance
 +
** will need a test suite for 1.1.2 (and manage to run it)
 
===Linkages to non-WCS servers===
 
===Linkages to non-WCS servers===
 
* Issue: is there a need?
 
* Issue: is there a need?
Line 53: Line 119:
  
 
==Data Network: Technical Realization (IT) Issues and Solutions==
 
==Data Network: Technical Realization (IT) Issues and Solutions==
 +
Network components, servers, clients, catalog | Mediated access | Catalog implementation, integration | Network level data flow, statistics, performance
 
===What is the design philosophy ===
 
===What is the design philosophy ===
 
Service oriented (everything is a service), Component and network design for change; open source (everything?!)  
 
Service oriented (everything is a service), Component and network design for change; open source (everything?!)  
Line 58: Line 125:
 
===Content and structure (granularity) of ADNC?===  
 
===Content and structure (granularity) of ADNC?===  
 
===Interoperability of ADNC===
 
===Interoperability of ADNC===
Interoperability with whom? what standards are needed? CF Naming extensions? ===
+
This is a key question for achieving the goal to transform the ADNC from virtual to real. There are a number of systems out there which provide services in either real-time or from archived data. Connecting these services, so that data from all of the different sources can be made available through a single interface (note: this doesn't mean one implementation of this interface!) represents different technical and non-technical challenges. Different protocol versions, different OGC services, different metadata descriptions and different data formats upon delivery need to be recognized and some harmonisation must be achieved here.
 +
 
 +
From the existing services the Community WCS server hubs (Datafed and Juelich) and the NASA/DLR ACP are probably most advanced in terms of implementing data services through the OGC WCS standard. Yet, it remains to be demonstrated that these services can be connected in the fully interoperable loose coupling sense.
 +
 
 +
Interoperability with whom? what standards are needed? CF Naming extensions?
 +
 
 
=== Access rights and access management===
 
=== Access rights and access management===
 
===What are the generic (ISO, GEOSS, INSPIRE) and the AQ-specific discovery metadata?===
 
===What are the generic (ISO, GEOSS, INSPIRE) and the AQ-specific discovery metadata?===

Latest revision as of 08:55, August 30, 2011

< Back to AQ CoP.png | Workshops | Air Quality Data Network

Air Quality Data Network (ADN), Non-IT Issues

Virtual to Real Purpose and scope of ADN | AQ network stakeholders | Relationship to integrating initiatives | Governance, legitimacy, impediments

What is the purpose of ADN?

  • Facilitate access to air quality information (measurements, model results, analyses) in order to enable users (stakeholders) to make better informed choices.
  • Boost scientific understanding of air chemistry and pollution transport processes by enabling synthesis views based on multi-platform observations and model ensembles.
  • Reduce chaos by improving standardization and documentation of existing data sets and through exposing data processing and quality control procedures
  • Ensure reproducibility of scientific analyses by allowing for traceability of data sets and definition of responsibilities within the data chain.

Who is involved in the ADN and what are their roles?

  • Data providers: mandated data origin vs. scientific data sets; what are the implications of a real ADN for them?
  • Data federators/network hubs: who does it and why? Who has a mandate? What are the respective roles and focal points?
  • Users: Who are they? What is the interaction with them? Who should become a user but doesn't know about it yet?

What organizations are stakeholders in the network? How do they relate to ADN?

What is the ADN governance, legitimacy, impediments

What are the minimum requirements for ADN?

What few things must be the same, so everything else can be different.

  • Autonomy - Interoperability balance...
  • Who formulates the requirements?
  • When is an ADN an ADN?

What is the scope of AQN?

Geographic, Variables, Ambient fixed station observations, satellite observations, emissions, models?

What data (processing) level served through ADN?

Is ADN a decider? Why not the Provider? Data Raw, Derived;

Data Servers: Technical Realization (IT) Issues and Solutions

netCDF, CF, WCS standards and conventions | Implementation for gridded and station data | Development tools | Server performance

Issues re. the use of netCDF and other data formats

netCDF is standard format for multi-dimensional data. Cf-netCDF is used both as an archival format of grid data as well as a payload format for WCS queries.

  • Issue: ambiguity and completeness of CF
    • development of a server independent (python) CF-API library
    • Brainstorming: What is missing in CF?
    • Issue: CF (udunits) time format not the same as ISO Time format (as used by WCS)
      • those two cases can be processed with different code, but uniformity would be less confusing
      • could try to get ISO time recommendation into CF; would still need different code because it's only a recommendation
    • Issue: geo-referencing
  • Issue: We should define a standard NetCDF python interface (PyNIO, python-netcdf4, scipy.io.netcdf?)
  • Issue: other output formats
    • support fused into server or add-on concept (possibly using the public W*S/NetCDF interface)
    • Delivery of (small) data sets in ASCII/csv format?
  • Issue: Reading other gridded input data formats? (i.e. GRIB)
  • Issue: traceability and revision tracking of datasets (in WCS metadata as well as in NetCDF metadata)
  • Info: Cf standard name grammar: CF metadata grammar concept

Server co-development tools, methods

Server code is maintained through SourceForge (bugtracker, tar balls), Darcs code repositories are available at WUSTL and in Juelich.

  • Issue: Version control
    • maintain a common codebase
  • Issue: Documentation
    • mainly inline tech documentation to date
    • need more documentation regarding operation
    • proposal: sphinx for proper documentation (easy to include inline tech doc)

Use of WMS, WCS, WFS .. in combination?

Data display/preview is through WMS. AQ data can be delivered through WCS, WFS. In AComServ, WCS for transferring ndim grid and point-station data; WFS for deliver monitoring station descriptions.

  • Issue: WMS interface for preview; "latest" token for dynamic links?
    • generic WMS service operating on external wcs
    • "latest" token could be realized on WCS-WMS interface by using the metadata on the WMS/client side and requesting the latest time; latest time could be default response of WMS server if nothing else requested

Gridded data service through WCS

WCS is implemented in multiple versions: 1.0, 1.12, 2.0. The AQ Community Server (AComServ) is now implemented using WCS 1.1.2. This generally works well.

  • Issue: serve "virtual" WCS datasets with continuous time line assembled from many source files
    • clients should only have to do one query to receive the whole times series in one piece instead of requiring the client-side logic to request multiple pieces
    • could create a "wrapper" module that can handle such cases with knowledge of the server-side file structure
      • Kari has already done something like this for HTAP datasets, this could be a starting point
  • Issue: desirable time filtering options in WCS: hour of day, day of week, day of month, etc.
  • Issue: Extraction of vertical levels
    • already defined through the rangesubset/fieldsubset parameter
    • is this definition OK for us or would we need something else/better?
      • potential problem: only enumeration of levels possible, no ranges
  • Issue: current state of WCS 2.0? core released, but extensions still in draft (how do we know/keep track of what is currently valid?)

Delivery of station-point data

Access rights

  • Issue: technical options to restrict access to datasets?

Data server performance issues/solutions?

Define performance issues, measurements, ideas

  • Issue: especially big datasets take a long time to prepare for delivery (slicing/subsetting, etc.)
    • direct streaming of datasets to the client could be part of the solution, click here for details
    • generated datasets could be cached for a while, so they could be delivered again when there is a request with compatible parameters
    • problem: both proposals might be mutually exclusive to some degree
  • Issue: management overhead when opening NetCDF
    • when opening a NetCDF file, some metadata has to be read and data structures have to be set up
      • input files could be kept open for a while to avoid this overhead
  • Issue: temp file space is limited on WCS server
    • streaming approach for store=false parameter would not require additional local storage
    • temp file approach for store=true parameter could be limited by a maximum dataset size
      • requires a reliable output file size estimator
      • server would return an exception if estimated size is over given threshold
      • would force people to use store=false for large datasets
      • should not violate WCS 1.1 standard (too badly) as only store=false is mandatory
  • Issue: XML Metadata assembly might take a long time depending on the catalog content, i.e. with a lot of Identifiers
    • GetCapabilities response Metadata is very static anyway, other responses (DescribeCoverage) could be cached for a while
      • attention: DescribeCoverage response depends on parameters
    • minor issue compared to actual data delivery performance

Relationship to non-AComServ (non-NetCDF) WCS servers

  • data format(s)
    • most WCS clients don't understand NetCDF
  • Issue: protocol compatibility
    • might need to implement more optional features of WCS
  • standard compliance
    • will need a test suite for 1.1.2 (and manage to run it)

Linkages to non-WCS servers

  • Issue: is there a need?
  • Issue: which protocols? (OpenDAP?, GIS servers?)

Data Network: Technical Realization (IT) Issues and Solutions

Network components, servers, clients, catalog | Mediated access | Catalog implementation, integration | Network level data flow, statistics, performance

What is the design philosophy

Service oriented (everything is a service), Component and network design for change; open source (everything?!)

Functionality of an Air Quality Data Network Catalog (ADNC)?

Content and structure (granularity) of ADNC?

Interoperability of ADNC

This is a key question for achieving the goal to transform the ADNC from virtual to real. There are a number of systems out there which provide services in either real-time or from archived data. Connecting these services, so that data from all of the different sources can be made available through a single interface (note: this doesn't mean one implementation of this interface!) represents different technical and non-technical challenges. Different protocol versions, different OGC services, different metadata descriptions and different data formats upon delivery need to be recognized and some harmonisation must be achieved here.

From the existing services the Community WCS server hubs (Datafed and Juelich) and the NASA/DLR ACP are probably most advanced in terms of implementing data services through the OGC WCS standard. Yet, it remains to be demonstrated that these services can be connected in the fully interoperable loose coupling sense.

Interoperability with whom? what standards are needed? CF Naming extensions?

Access rights and access management

What are the generic (ISO, GEOSS, INSPIRE) and the AQ-specific discovery metadata?

Minimal metadata for data provenance, quality, access constrains?

Single AQ Catalog? Distributed? Service-oriented?

Network-level data flow, usage statistics (GoogleAnalytics), performance