GEOSS AIP-II: Air Quality Community Infrastructure
The content of this page came from the paper: Enhancing Data Discovery, Understanding and Usage through an Air Quality Metadata System
E.M. Robinson a*, R.B. Husar a, S.R. Falke a,b, E.T. Habermann c, S. Nativid, A. Warnocke, M. Hogewegf, J. Liebermang
a Center for Air Pollution Impact and Trend Analysis, Washington University, St. Louis, MO 63105, USA – (emr1, rhusar, stefan)@wustl.edu b Northrop Grumman, St. Louis, MO 63101 c National Oceanographic and Atmospheric Administration, Boulder, CO – [email protected] d University of Florence, Prato, Italy - [email protected] e A/WWW Enterprises Crownsville, MD 21032 – [email protected] f ESRI, Redlands, CA - [email protected] g Transverse Technologies, Need location - [email protected]
Air quality data is normally created by a data provider for a particular mandated end user. The format of the data is specific to the use and the metadata that describe the data are focused on a certain application. More and more, the GEOSS idea is growing that ‘a single problem requires many datasets and a single dataset can serve many applications’ (Zhao, 2006).
In order to access a dataset providers need to be able to publish and users need to be able to find and bind to a dataset of interest. Standardized data access services allow interoperability in binding to a dataset. However, the first two parts of publishing and finding a data access service are still problematic. In particular: (1) How does one describe the data access service so that it can be discovered by users? (2) Where does one publish their data access service so that it can be found? This paper describes the process that was taken to enhance the publishing and finding of Earth observation data access services. While the paper focuses on air quality datasets at this point, the only things that are specific to air quality are the test datasets we have used. This method could be applied to data access services in any other societal benefit area within GEOSS.
2. COLLABORATION METHOD
The development of the metadata records and publication through the GEOSS Common Infrastructure (GCI) was done collaboratively within the Architecture Implementation Pilot-II of GEOSS. The pilot work was broken into two dimensions (Fig.1): societal benefit area and transverse technology work groups.
The societal benefit work groups used their domain to test end-to-end flow of data and metadata through the publish-find-bind process facilitated by GCI. The transverse workgroups worked on particular technologies and tools specific to publishing, finding or binding (AIP-II, 2008).
The main transverse workgroup responsible for the publishing and finding of data access service metadata was the Catalog, Clearinghouse, Registries and Metadata workgroup (CCRM WG, 2009). This group worked closely with the air quality workgroup in order to create standard metadata records that could be published through a web accessible folder, harvested and exposed by the clearinghouses to the portals and found there by AQ users (Fig. 1). The communication among the workgroups was mostly through weekly telecons, an active list-serv and the Google sites workspace for Pilot activities. Currently there are three clearinghouses under development through USGS, ESRI and Compusult. A Google Doc spreadsheet was collaboratively created to relate the metadata record elements to queryable fields in each clearinghouse and can be viewed in the supplemental material.
For more info on collaboration and the pilot see the poster given at AGU
3. ISO 19115 METADATA RECORDS
One of the goals of the AIP-II was to identify a GEOSS record that could be common for all Earth observations. The AQ group worked with the CCRM group to identify the key elements needed for discovery and create a standard metadata record that could be used by the community. For this the AQ WG used the ISO 19115 geographic information metadata standard to create the metadata records because it is the geospatial metadata standard recognized by GEOSS (2005) in the GEOSS Standard Registry (GEOSS, 2008) because it is open and internationally recognized. The ISO 19115 standard is also compatible with Catalog Service for the Web (CSW) standard interface with an ISO 19115 profile for dynamic querying (ISO 19115, 2005). The CSW and ISO 19115 Profile for CSW are also GEOSS recognized catalog service standards. Another benefit of the ISO 19115 standard is that it allows for not only discovery metadata to be included, but also includes data lineage, usage and other types of metadata for understanding multiple applications. The initial focus of metadata creation was on discovery however future work will include the extension of metadata to usage, lineage, and other components of the metadata. The discovery metadata record is defined as the minimum information a user needs to find the data and bind to it.
To identify the fields necessary for discovery of the metadata record we started with the ISO 19115 Core Metadata fields (Nogueras-Iso, 2005) and also the fields that were needed for a Catalog Service for the Web (CSW) Record. This fulfilled two goals – one to create valid ISO 19115 records and the other to be able to retrieve the records from the clearinghouse through a CSW 2.0.2 query. This list was compared to the current GEOSS Clearinghouses’ queryable and returnable fields. It was found that none of the clearinghouses or the GEOSS Service Registry allowed all of the CSW:record fields to be searched. It was also found that the three clearinghouses queried over different fields in the metadata record, there wasn’t a standard nomenclature for field names and key fields for Earth observation data, like temporal extent of the datasets, were missing.
Through the collaboration between the AQ WG and the CCRM WG, a key set of queryable fields for all Earth observations has emerged as a potential GEOSS Record: Dataset title, abstract, geolocation, temporal extent, keywords, service type, metadata file ID and associated component (catalog information). This set allows for text search as well as spatial and temporal searches. If a common vocabulary is used, additional keyword search allows discovery by measurement platform or observed phenomena. Information about the service type allows filtering by specific service standard and the associated component links the metadata record of a given dataset to a larger community catalog. These queryable fields can be in any metadata format, it is just important that the community agrees this set is needed to find any Earth observation service. With the GEOSS Record plus the ISO 19115 Core Metadata fields the AQ WG had a metadata record that could be used for discovery.
4. AIR QUALITY COMMUNITY CATALOG
The community catalog is a component contributed by the Earth Science Information Partners and is registered in the GEOSS Component and Service Registry as both a component and a service. There were several options that the GEOSS AIP group considered when creating the community catalog, Catalog Service for the Web (CSW), Z39.50 and Web Accessible Folder (WAF). The WAF is the simplest option since the only criteria is that there is a folder accessible on a server and the WAF is “agnostic” to what metadata format is used. The Renewable Energy work group in the AIP-II and NOAA both used the WAF method for publishing metadata and were successfully harvested by the ESRI Clearinghouse early in the Pilot development. The benefit of being aware of what these other groups were doing through the AIP Plenary telecons and the Google Sites Pilot workspaces allowed the AQ community to learn faster from others and ultimately implemented this option for our initial community catalog as well.
One drawback to the WAF is that it is not a standard service interface. Additional steps were taken to register the WAF as a special arrangement in the GEOSS Standard Registry. This process of testing the GEOSS architecture through the AIP-II identified the need for the WAF and the special arrangement was based on a group agreement within the Pilot.
Another drawback to the WAF versus the CSW or Z39.50 interfaces is that records in the WAF don’t have reliable date stamps to identify changes in content. Therefore the only way to ensure that the most recent content is harvested is to re-harvest the entire WAF every time. CSW and Z39.50 records include a modification date so that clearinghouses only re-harvest records that have modification date more current than the last harvest.
5. PUBLISH-FIND-BIND FLOW THROUGH GCI
Publish: Initially it was thought that if one “published” data on the web in any format that was enough for sharing. Then it became obvious that if you wanted to integrate datasets that standard data access formats were useful and publication meant that one exposed a standard data access service available on the web. Through the Pilot it became apparent that just exposing the data access service with a GetCapabilities document doesn’t provide enough metadata for discovery and additional standard metadata needs to be published by the provider or distributor for discovery of the service, so the AQ metadata record and AQ Community Catalog was created.
The services we initially registered for the Pilot are OGC WMS and WCS services and they have a GetCapabilities document which has some metadata information included and can be mapped to ISO 19115 fields (Nativi, 2008). To further improve the use of GetCapabilities documents for metadata creation, we organized our data access services by dataset, so that the general metadata information was dataset specific and the coverages were each dataset parameter. Only a handful of additional fields are needed to create a valid metadata record and these can be hard coded or entered at the time or registration.
Figure 2. shows the flow of metadata. Starting with the service provider, the GetCapabilities document is used to create an ISO 19115 metadata record for the data access service. The xml document is saved into the community catalog. The community catalog is registered as a component in the GEOSS Component and Service Registry (CSR). The GEOSS Clearinghouses query the GEOSS CSR for catalogs and then harvest the catalogs for their metadata records, ending the metadata publishing process.
Figure 2. Publish-Find-Bind process for data access services from service provider to service user
Find: In order to find the data access services in the clearinghouse one has to know what to search for. The clearinghouse queryable fields and metadata fields were mapped in a crosswalk to clarify how information was extracted from the harvested metadata records. The key queries we have been interested in so far are for finding our own records, through the parent identifier and searching for services by type and keyword.
The ESRI and USGS clearinghouses expose a search API which has enabled the AQ group to create a more customized search interfaces. One example interface that we have set up is using the USGS API and searching full text in order to find WMS or WCS services (Ref link).
Bind: Once the user has found the dataset of interest the next step is to bind to the dataset (Fig. 2) and display the data access service through the tool of choice. Each service is described in the metadata record with its GetCapabilities URL. Through preliminary testing with WMS services, the Compusult Clearinghouse is able to bind to the GetCapabilities URL found in the metadata record and display a WMS instance of the map.
6. RESULTS AND NEXT STEPS
Through the community efforts of AIP-II, the Air Quality Community has had an initial, yet limited success in establishing the flow of metadata and data through the GCI. Data access services can be published to the WAF as ISO 19115 metadata records. When the WAF is registered as a component and service in the GEOSS CSR, the WAF is harvested and the data access services are exposed through the clearinghouses.
Interactions between the clearinghouse and WAF are still changing. There are currently no formal procedures set up to schedule harvesting other than through e-mail requests and informal arrangements. There are also issues with deletion of records in the WAF and how that propagates to the clearinghouse. Finally even though the metadata contain fields needed for discovery, the clearinghouses do not yet search for all of these fields needed for discovery making it difficult to find records.
This work is in progress and still evolving. The collaborative nature of developing publication and discovery methods within the GEOSS AIP has proved to be a worthy approach and will be continued to advance the work to date. The metadata for discovery may continue to change as others add their content and the fields need to be revised. Additionally, discovery is only one aspect of the data life cycle and the next steps will be to further close the loop between provider and user through more flexible DataSpaces (Robinson et. al, 2008). DataSpaces will incorporate the structured ISO 19115 metadata described above as well as further extend the metadata to incorporate user feedback, discussion, free tags and harvest flexible community-contributed content like papers and web applications related to the dataset.
Architecture Implementation Pilot-II, 2008, http://www.ogcnetwork.net/AIpilot
Air Quality Workgroup Workspace, 2009, https://sites.google.com/site/geosspilot2/air-quality-and-health-working-group
Clearinghouse, Catalog, Registry and Metadata Workgroup Workspace, 2009, https://sites.google.com/site/geosspilot2/Home/clearinghouse-catalogue-registry-metadata
GEOSS 10 Year Plan Reference Document, 2005, http://www.earthobservations.org/documents/10-Year%20Plan%20Reference%20Document.pdf
GEOSS Standards Registry, 2008, http://www.earthobservations.org/gci_sr.shtml
McCabe et. al. 2009. The GEO Air Quality Community of Practice: a Call to Participate. ISRE, Stresa, May 2009.
Nativi, S., 2008 “Mapping WCS 1.1 GetCapabilities and DescribeCoverage into ISO Metadata – 19115 and 19119”
Nogueras-Iso, J., Zarazaga-Soria, F.J. & Muro-Medrano, P.R. Geographic information metadata for spatial data infrastructures: resources, interoperability and information retrieval. (2005).
OpenGIS® Catalogue Services Specification 2.0 - ISO19115/ISO19119 Application Profile for CSW 2.0, 2005 http://portal.opengeospatial.org/modules/admin/license_agreement.php?suppressHeaders=0&access_license_id=3&target=http://portal.opengeospatial.org/files/index.php?artifact_id=8305
Robinson et. al., 2008. “Dataspaces:Using Community Workspaces to Enable Rich Air Quality Metadata” presented at Fall AGU.
Zhao, D. “GEOSS Overview and Progress” presented at the WMO TECO-WIS,. Korea. 6-8 Nov 2006.