NASA ACCESS09: Tools and Methods for Finding and Accessing Air Quality Data

From Earth Science Information Partners (ESIP)
Revision as of 12:01, June 29, 2009 by Erinmr (talk | contribs) (→‎Data as Service)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Air Quality Cluster > AQIP Main Page > Proposal | NASA ACCESS Solicitation | Context | Resources | Forum | Participants
uFIND: User-oriented Tool Set for Air Quality Data Discovery and Access

The objective of this work is connect air quality users to air quality-relevant data services at NASA and elsewhere through the implementation of a set of well-tested and effective services, tools and methods. The centerpiece of the proposed project is the web service-based tool set: user-oriented Filtering and Identification of Networked Data (uFIND). The purpose of uFIND is to provide rich and powerful facilities for the user to: discover and choose a desired dataset by navigation through the multi-dimensional metadata space using faceted search, seamlessly access and browse datasets and use uFINDs facilities as a web service for the mashups with other AQ applications and portals. Datasets found through uFIND will be accessible through standard OGC WCS, WMS data access protocols. A major perceived contribution of the proposed work is an infrastructure for the harvesting, harmonization and continuous augmentation of metadata through the metadata flow system. The metadata will follow the ISO 19115 standard for geospatial metadata. The Service Oriented Architecture of uFIND will allow service-based interfacing with providers and users of the metadata. uFIND will be applicable to other Earth Science domains and be instrumental in the emerging Global Observing System of Systems (GEOSS) and in particular in the development of the GEOSS Common Infrastructure (GCI)

Scientific/Technical/Management Section


This proposal is in response to the solicitation: Advancing Collaborative Connections for Earth System Science (ACCESS) 2009. In particular, it focuses on providing "means for users to discover and use services being made available by NASA, other Federal agencies, academia, the private sector, and others". The objective of this work is to implement a set of well-tested and effective services, tools and methods for the discovery and seamless access to air quality-related datasets.


Recent developments offer outstanding opportunities to fulfill the information needs for Earth Sciences and support for many societal benefit areas. The satellite sensing revolution of the 1990's now yield near-real-time observations of many atmospheric parameters. The data from surface-based monitoring networks now routinely provide detailed characterisation of atmospheric and surface parameters. The �terabytes� of data from these surface and remote sensors can now be stored, processed and delivered in near-real time and the instantaneous �horizontal� diffusion of information via the Internet now permits, in principle, the delivery of the right information to the right people at the right place and time. Nevertheless, atmospheric scientists and air quality decision makers face significant hurdles. The production of atmospheric observations and models are rapidly outpacing the rate at which these observations are assimilated and metabolized into actionable knowledge that can produce scientific understanding and societal benefits. The �data deluge� problem is especially acute for atmospheric scientists interested in the use of satellite observations. As a consequence, most atmospheric observations are significantly under-utilized.
In the U.S., virtually all the important Earth Science datasets including air quality-relevant observations are now publicly accessible through the Internet. Typically, data providers place the numeric datasets onto a web server along with the associated 'Readme' files, which contains descriptive information about the data, access instructions and other metadata. Over the past decades, there was also a steady evolution of data directories where individual datasets could be registered, labeled and searched by potential users. The outstanding example of a long-term cataloging effort is NASA's Global Change Master Directory (GCMD) which illustrates the data deluge problems. A search of GCMD for "atmosphere" shows about 8000 entries for datasets and 1000 entries for services. Even an air quality parameter "aerosol" returns about 1000 datasets and 500 services. The directories of other agencies and nations probably contains at least this many entries.

In GCMD user support for finding a dataset is provided primarily through a controlled vocabulary of keywords along with free text search of the DIF-formatted standard metadata. Providers of Earth Observations can easily publish the data in GCMD and they only have to do it once. On the other hand, finding and accessing and using any given dataset takes a much larger effort and it is repeated individually by every user of a given dataset. Collectively, the users of a dataset may expend 100 or 1000 times more effort than a publisher of a dataset. As pointed out by Cleveland (1982), in an information rich environment most of the burden of accessing the proper information is carried by the user, not the producer. According to Taylor (1986) this type of information system is classified as provider (content) driven information system, distinctly different from a user driven Information system.

The center piece of the proposed project is the web service-based tool set: user-oriented Filtering and Identification of Networked Data (uFIND). The purpose of uFIND is to provide rich and powerful facilities for the user to discover and choose a desired dataset. These facilities include navigation through the multi-dimensional metadata space through the faceted search (uFind Pilot, 2009), ability to seamlessly access and browse datasets and to use uFINDs facilities as a web service for the mashups with other AQ applications and portals.

Expected Significance

The outcome of the proposed work will significantly reduce the burden on the user in finding and accessing data relevant to the understanding and management of air quality and atmospheric composition. Air quality is the scientific and research domain of the proposing team and the proposed IT technologies and infrastructure will be directly applicable to the furthering of research for this domain. Air quality is also a significant application area of satellite observations. The tools, methods and infrastructure will also be applicable to other domains of Earth Sciences.

The AQ data user benefits from uFIND by making it easier to find suitable data and accessing and utilizing the data in creating either scientific or actionable knowledge. The providers of AQ will also benefit since their data products will find easier re-use enhancing the relevance and importance of their products. Finally, the earth science community at large will benefit from the proposed work by the tools and methods for improved data discovery and access that is applicable to all earth science domains. The broad applicability of this proposed work will also be instrumental in the emerging Global Observing System of Systems (GEOSS) and in particular in the development of the GEOSS Common Infrastructure (GCI).

uFIND Use Case

The utility of the AQ uFIND is illustrated below through a use case applicable to air quality management. On May 12, 2007 an intense wildland fire broke out at the Okefenokee National Park in southern Georgia. The intense smoke from the fire engulfed southern Georgia and drifted toward Florida. By May 24, the smoke pall was also drifting northward along the Mississippi Vally, impacting much of the eastern U.S. The smoke increased the ambient aerosol concentration well over the ambient air quality standard (35ug/m3) causing exceedances. The air quality agencies in the smoke-impacted states were required to document that the exceedences were due to the Exceptional Event (EE) from the Okefenokee fires. In order to provide the required documentation, an air quality analyst of the impacted state would use uFIND (Fig.1).

Figure 1. Pilot Air Quality uFIND,

In the first step, the analyst would define the spatial and temporal domain of the area of smoke impact: Lat 15 to 45, Lon -90 to -65, 2007-05-27. Under the search facet 'Domain' clicking 'Fire', would reveal the fire observations for that geographic area and time range. The selection shows that the dataset: NOAA_HMS_WFS from satellite platforms contains multiple fire-related parameters arising from a variety of sensors (AVHRR, GOES, MODIS). After a few minutes of browsing each parameter through the Data Viewer integrated in uFIND, the analyst chooses the parameter HMS_All to represent the fire locations. Next, the analyst chooses the 'Aerosol' domain to explore the various aerosol datasets and parameters, which yields hundreds of parameter records from numerous datasets. In order to confine the search, the analyst chooses 'Satellite' from the 'Platform' facet and 'Day' from 'Time Resolution' facet, which still yields dozens of satellite-derived parameters. Since her interests is in the geographic location of smoke, she chooses the parameter, AOD (Aerosol Optical Depth), which occurs in two datasets, one from MODIS4_AOT and Seaw_US. In order to assure that the satellite aerosol signal is due to light absorbing smoke, she also examines the parameter, 'ABS_AER', which is a smoke indicator in the OMI_AI_G dataset. Each of the selected datasets are accumulated as layers in the Data Browser.

As a next step, the analyst examines the magnitude of the smoke impact on the surface concentration of aerosols. Clicking 'Network' on the Platform facet and 'Hour' on the Time Resolution facet and 'Surface' on the Vertical facet returns all the hourly measured parameters on the surface, which can be found in three datasets: Airnow (EPA's real-time, surface monitoring network), AQS_H (EPA's additional hourly data) and Surf_Met (National Weather Service meteorological observations). For the smoke impact analysis, the analyst chooses the PM2.5 from Airnow and AQS_H and RHBext from SurfMet. These surface observations are also added to the stack of data layers in the Data Viewer. Having the location of the fires identified by the satellite-derived fire pixels, the smoke dispersion pattern from the satellite AOD and confirmation from the absorbing aerosol index she has sufficient documentation to satisfy the first requirement of the Exceptional Event Rule: Evidence that the smoke event occurred and that the smoke was observed in impacted areas.
Subsequently the examination of the surface observations reveals whether the satellite-observed smoke had a corresponding observed impact on the surface concentration. Through additional use of the available chemical composition data, she can also document that the excess aerosol mass concentration during the even can be attributed to smoke organic compounds.
The above described scenario is realistic and it was taken from an actual Exceptional Event case that was analyzed for EPA (Evidence for Flagging Exceptional Events , 2008). The original time expenditure for those analyses required weeks of effort by experts to find, access and analyze the data. With the aid of the new uFIND System, the same activity can be performed in a matter of hours and the resulting report can be more thorough because of the broader data pool available. We anticipate that similar time-saving and overall productivity improvement can be achieved in other areas of air quality science, management and policy. A particularly suitable application area is the international policy development regarding the Hemispheric Transport of Air Pollutants (HTAP), where satellite observations are crucial for documentation of intercontinental transport.

Technical Approach and Methodology

Information systems are practice driven. There are no natural laws to follow, the field is in pre-scientific age. Major input to the information system must come from an analysis of the information use environment, that a) Establishes the information flow into, within and out of an entity; b) Determines the criteria by which the value of information is judged. (Taylor, 1986).

There are major impediments to seamless and effective data usage encountered by both data providers and users. The impediments from the user's point of view are succinctly stated in the report by Research, P.O.I.T.A.T.C.O (1998), in short: the user can not find the data, if she can find them, she can not access them, if she can access then, she does not know how good they are, if she finds the data good, she can not merge them with other data. Similarly, the data providers face a similar set of hurdles: the provider can not find the users, if she can find users, she does not know how to seamlessly deliver the data, if she can deliver, she does not know how to make them more valuable to the users. This project intends to provide support to overcome the first two hurdles: finding and accessing the air quality data.

Architecture and Technologies Used

The center piece of the proposed project is the web service-based tool set: user-oriented Filtering and Identification of Networked Data (uFIND). The purpose of uFIND is to provide rich and powerful facilities for the user to discover and choose a desired dataset. These facilities include navigation through the multi-dimensional metadata space through the faceted search (uFind Pilot, 2009), ability to seamlessly access and browse datasets and to use uFINDs facilities as a web service for the mashups with other AQ applications and portals.
s (Fig.2).

SOA Metadata.png

Fig 2. Service Oriented Architecture and ISO 19115 Metadata Schematic
The critical aspect of SOA is the loose coupling between service providers and service users. Loose coupling is accomplished through plug-and-play connectivity facilitated by standards-based data access service protocols. SOA is the only architecture that we are aware of that allows both loose, dynamic connection and seamless flow of data between a rich set of provider resources and diverse array of users. In SOA service provider registers services in a suitable service registry, the users discover the desired service retrieve an access key and seamlessly access the data. The result is a dynamic binding mechanism for the construction of loosely-coupled work-flow applications.

Service orientation, has been accepted as the desired way of delivering Earth Observation data products. However, the adoption of formal standards-based data-as-service offerings have been slow within NASA and other Agencies. Offering images through OGC WMS standard interface is becoming common for many Federal Agency data products, but there is currently no effective way for the users to find those services since they are dispersed over many web pages.

This project will rely on mature and widely used standard protocols for interoperability:
(1) OGC WMS, WCS and WFS web services for accessing data;
(2) netCDF-CF as standard data formats for gridded and point monitoring data and models
(3) ISO 19115 for geospatial metadata
(4) RSS/Atom and HTTP for inter-service message transfer. (Fig. 4)

Furthermore, the uFIND software system will be implemented using three key maturing 'intellectual technologies': (1) Tagging for flexible, user-extensible structuring and annotation of diverse metadata; (2) Faceted search technology for navigation through multidimensional data discovery and (3) Ajax-based dynamic user interface to the data discovery and spatio-temporal data browsing and exploration.

The individual technologies described above can be considered at a technology readiness level (TRL) of at least 7. They have been developed and implemented in near operational environments. Many are used routinely in web data and information exchange among scientific and public domain communities. The work described in this proposal is targeted at combining those technologies and techniques in order to provider a user-driven uFIND.

Engineering and Implementation

The three major components the proposed uFIND, SOA-based software are: (1) facilities to identify AQ-relevant data services (2) facilities for publication of data as a services and (3) facilities for AQ users to find and access data services. In this design, each component is a service which allows the reuse of any part of the system in mashups with other web-based applications. Each of these major components will be supported by a set of tools and methods.

User-Oriented Filtering of Relevant Datasets

The AQ user needs will determine what datasets are initially targeted for inclusion in uFIND. The identified AQ Earth Observation priorities from the GEO Task US-09-01A (GEO Task US-09-01a Home Page, 2009) as well as other published, user-filtered dataset lists like the one prepared by Scheffe for EPA will guide the human and machine surveying of catalogs like GCMD, the GEOSS Clearinghouse, GeoSpatial OneStop, etc. in order to identify potential datasets. The identification of AQ-relevant datasets will be performed by the proposed team. However, it will expand to be a communal process. Initially, the proposing team, with the ESIP AQ Workgroup will work with AQ data providers and AQ users in order to populate uFIND with relevant AQ data. As the critical mass of participation is reached within the uFIND network it will allow the larger community of participants to act as a filter for what is valuable to the AQ Community.

Another batch of selected datasets that will come from several "data hubs" that already have a subset of datasets relevant to particular AQ analysts. Some examples are, EPA AQS for State Analysts, VIEWS for RPO's, NILU EMAP, NASA Giovanni as well as DataFed, developed by the proposing team. A full listing of current air quality data systems has been compiled during EPA's Data Summit in 2008 (Summit, 2008), which also contains many user recommendations. In uFIND these datasets will be harmonized, so that the data being offered through any of these hubs can be reused by multiple applications.

It is hoped that uFIND will also attract potential AQ data providers from the AQ community to participate through the incentive of reaching more data users as well as the immediate reward that when the dataset is accessed through standard formats it immediately can be used with many viewers and other client applications. Participation in uFIND will also allow providers to review web analytic information and connect with air quality users of their data, receiving both direct and indirect feedback about their data products.

Data Access

[File?id=ddxhh8bp_945g2x458hg_b ]The first condition of uFIND is providing data through standard interface. Datasets found through AQ uFIND will be accessible through standard OGC WCS, WMS data access protocols. This allows users to seamlessly access the data subset of interest. By adding standard interfaces, datasets can be converted into other user formats like netCDF, KML, accessible by data browsers for exploration.

Data as Service

OGC WCS is particularly applicable for representing space-time-varying phenomena in Fluid Earth Sciences, atmosphere and oceans. OGC WCS version 1.1 is limited to grids, or "simple� coverages, with homogeneous range sets but future revisions of the standard are anticipated to include support a broader set of coverages, including point coverages.

WCS DataAccess.png

Figure 3. a. WCS Communication Protocol b.WCS Data Wrappers

An attractive feature of these services is that (1) they can be executed using the simple, universal HTTP GET/POST Internet protocol; (2) the services are described by formal XML documents ("GetCapabilities", "DescribeCoverage") and the access instructions and output formats can be advertised in those service documents (Fig. 3a).

Wrappers and Tools

Most air quality datasets reside on servers in individual files or in SQL servers. These datasets need to go through a 'wrapping process' (Fig. 3b). Wrappers are reusable interfaces that turn data in files into data as service. Wrapper classes exist for: Sequential Images and Files, SQL server databases and for netCDF data files.

A particularly important wrapper class is for data files formated according to the netCDF-CF convention. Special effort will be invested to create a portable software template for accessing netCDF-formated data using the WCS protocol. It is hoped that this tool will promote the access and ease of use of distributed data.

The netCDF-CF file format is a common way of storing and transferring gridded meteorological and air quality model results. The CF convention for structuring and naming of netCDF-formated data further enhances the semantics of the netCDF files. The netCDF CF data format is most useful for the exchange of multidimensional gridded model data. It was also demonstrated that the netCDF format is well suited for the encoding and transfer of station monitoring data. Traditionally, satellite data were encoded and transferred using the HDF format. The new netCDF version 4 (beta) library provides a common API for netCDF and HDF-5 data formats. The netCDF-CF data format is supported by a robust set of well-documented and maintained low-level libraries for creating, maintaining and accessing data in that format for multiple platforms (Linux, Windows). The low level libraries provided by UNIDATA also offer a clear application programing interface (API).

The WCS wrapper for netCDF software has the triple functionality: (1) Accessing netCDF-CF files contents over the HTTP Get Internet protocol; (2) Imposing a standard data query language using the WCS standard; (3) Allow easy (non-intrusive) adaptation to evolving standards. The main components of the wrapper software are shown schematically in the Figure left. At the lowest level are open source libraries for accessing netCDF and XML files. At the next level are Python scripts for extracting spatially subset slices for specific parameters and times. At the third level, is the WCS interpreter that parses the WCS url. The Capabilities and Description files are created automatically from the NetCDF files, but you can provide a template containing information about your organization, contacts and other metadata. The WCS - netCDF-CF wrapper is a communal activity initiated by our group and pursued by collaborators from the US, New Zealand, Germany and elsewhere. It is also a participatory 'project' within the ESIP Air Quality Workgroup. It is anticipated that the wrapper will find wide application in this project.

AQ uFIND Metadata

The metadata has the primary purpose to facilitate finding and accessing the data in order to help dealing with first two hurdles that the users face. In current catalogs metadata is provided by the provider or distributor of the data. This metadata includes intrinsic discovery metadata such as spatial and temporal extent, keywords and contact information for the provider. The metadata also includes distribution information for data access. Additionally, providers include various other information to help users once they are at their site. This approach is provider-driven and does not incorporate the user needs.

In a user-centric information system the user experience is improved by metadata contributed along the entire line of data usage from the providers to the users. Furthermore, the datasets need to have additional AQ-relevant metadata added to metadata that the provider gives in order for the AQ user to easily find the data. This additional metadata allows for sharp queries to be given in the parameter space, time, and physical space. Another feature of the user-centric system is that using web analytics additional metadata is attached to each dataset in order to provide information about dataset usage characteristics.

The uFIND system incorporates the structured metadata along the data usage chain using the ISO 19115, Metadata for Geospatial Data Standard. ISO 19115 is ideal because it's structure accounts not only for traditional data access and discovery metadata, but also for usage, lineage and other metadata needed for understanding the data. The uFind record also includes pointers to additional metadata resources like a pointer back to the original metadata record from the data provider and a pointer to the associated DataSpace where the metadata record can be viewed.The ISO 19115 is also registered in the GEOSS Standards Registry, which allows uFIND to be harvested by the GEOSS Clearinghouse and found through other portals.
Metadata Flow Architecture

The major perceived contribution of the proposed work is an infrastructure for the harvesting, harmonization and continuous augmentation of metadata through the metadata flow system tailored to the use in air quality applications. A design of the metadata system is shown schematically in Figure 4.

Figure 4. Schematics of the uFIND metadata flow and components

The description below will follow the components schematics from left to right. It is to be noted that the above schematics is the representation of an actual functioning prototype that has been implemented for the GEOSS Architecture Implementation Pilot-II (GEO, 2009). However, it is anticipated that both the individual components as well as the overall functionality of the metadata flow infrastructure will continue to evolve during this project.

Provider-Contributed Metadata

The components to the left indicates the providers that are anticipated to supply metadata. Each supplier is connected to uFIND through two processes. The first step is to filter available metadata resources and select the records that are relevant to air quality. The second step is to transform the diverse source metadata to the uniform AQ metadata records adapted by uFIND (ISO 19115). It should be recalled that a pre-condition for registering an air quality-relevant dataset in uFIND, is that the data are accessible through an OGC standard data access protocol.

Metadata Tools

This proposed work will provide tools developed in collaboration with EU INSPIRE (INSPIRE, 2009) and others to create ISO 19115 metadata records either through the transformation of one metadata standard to ISO or by expanding the OGC GetCapabilities document. The tools simplify the process of generating metadata by a semi-automatic process. Metadata in various formats such as, FGDC, DIF, etc have crosswalks to ISO 19115. Using the published crosswalks, the metadata will be re-mapped into ISO 19115 through a style sheet transformation.

The proposing team has developed a metadata tool that maps corresponding GetCapabilities metadata elements to ISO 19115 metadata elements and provides a user interface to manually complete the remaining ISO 19115 metadata elements for which the GetCapabilities do not contain information. The provider is also able to modify the metadata elements that were extracted from the GetCapabilities and then save the record as an ISO 19115 compliant file. Each metadata record will incorporate additional AQ-specific metadata needed for the AQ uFIND as well as content acquired during the downstream usage. Prior to acceptance in uFIND, each ISO metadata record is also validated through a web service ISO 19115 validator.

Data Finding

Records are browsed using a customized, faceted search interface that was built to search the extended AQ metadata record set and find AQ data using specific filters such as sampling platform and data structure. Finding the right data is further enhanced by the user's ability to immediately view data as WMS through multiple clients provided by ESRI, Compusult and others. uFIND also allows query results to be embedded in another web page and also to link to multiple WMS viewers to browse layers in the catalog.

In uFIND, the AQ User can search for datasets through a faceted search which reacts to each step of the user's query. This dynamic interface ensures that the user can navigate by facets more familiar first, and make decisions about less familiar facets as the search is narrowed. The results returned will have additional metadata to aid decision-making like # of users that have used this datasets or links to places where it is used, so that the AQ-user is provided with some additional context.

Given that each AQ dataset is equipped with the standard data access interface, uFIND includes a Data Viewer. The user can browse data layers, compare them and further explore the data, ultimately choosing the most appropriate data for their application.

The AQ uFIND will be equipped with Google Analytics, which tracks multiple dimensions of data query usage as metrics. The analytics information will be exposed to the users as well and information derived from the metrics such as most popular queries will aid users in how to start their queries. The monitoring of query popularity is possible because each query has a unique URL that has a certain number of page views. The Google Analytics will also identify how most users access a dataset, i.e. through Google Earth, WMS, WCS, netCDF, etc. This feedback will help to hone the catalog to provide more useful information to users such as more datasets in a certain domain or tips for what others who viewed this dataset also viewed. The metrics will also help providers by identifying who some of the users of the data are and how they are finding a data product.


Currently, metadata for air quality datasets is variable, distributed and normally created by the provider for the user. However, a single dataset can be used for many applications that the provider may or may not anticipate and the data may go through many value-adding processes before it reaches the "end user". Additional metadata can be created at any step along the usage chain and at this time there is no mechanism for collecting this metadata. Consequently, users don't know how a dataset has been used or what additional processing has occurred beyond the originator. One method to harvest and share metadata from all members of the usage chain is through community workspaces, DataSpaces. DataSpaces are virtual spaces for contributing and archiving metadata, discussing the dataset and harvesting distributed resources in order to capture the critical community knowledge about the dataset.

A DataSpace (Robinson, 2008) for a given dataset has two parts, structured, semantically rich metadata and flexible community-contributed metadata. The structured dataset description includes standard dataset metadata, data lineage, and data quality information such as provider, parameters, platform and time period. The additional value of the DataSpaces comes from the context provided by the dataset community: users, mediators and providers. This may be through links to other mediator or user-provided metadata, publications that reference the dataset or web applications and tools using the dataset. DataSpaces also provides a place where a dataset community can connect through discussion and announcements about the dataset. As DataSpaces evolves and is used more by the community, additional functionality will emerge. Currently, there are still many issues with the implementation of DataSpaces including how to link the DataSpace to the dataset as it moves along the usage chain and how material in DataSpaces can be reused in other metadata.

Connections to other Activities

uFIND is an open system and its functionality depends on inputs from AQ users, data providers as well as value-adding contributors. Most connections of uFIND system can be executed through formal service interfaces. This open architecture will be the key mechanism for the scalable growth and evolution of uFIND System.

For example within ESIP, we anticipate symbiotic interaction with the Semantic Web Cluster. uFIND with the thousands of data layers can provide a rich resource and be a use case for ontology development and testing. In return, the developed ontologies and catalog services will help uFIND improve the user experience. Similarly, it is anticipated that collaboration with various ESIP groups pertaining data provenance will be pursued and the results incorporated into the uFind metadata. This will include a provenance chaining service.

Collaborations with other activities will build upon previous and ongoing collaborations that have developed as part of the project team's earlier projects. Through recent community building exercises, such as the Federation of Earth Science Information Partners (ESIP) Air Quality Workgroup, GEOSS Architecture Implementation Pilot, the collaborative participants needed for contributing to a successful Air Quality uFIND have been already developed and through this project will be focused on a particular capabilities and needs.

Participation in ESDSWGs is expected to include the Technology Infusion working group. Rudolf Husar and Stefan Falke will contribute time, in the sum of 0.25 FTE, to contribute to these working groups. During a previous NASA REASoN and AIST projects Stefan Falke and Rudy Husar contributed to to the tech infusion, sensor web and reuse working groups. Based on previous experience, we anticipate that this ACCESS project would provide the most suitable contributions to the Technology Infusion Working Groups but will work with NASA to scope our participation.

It is worth noting that a similar requirement in an earlier NASA REASoN CAN compelled us to become involved in the Earth Science Information Partners (ESIP) and create and coordinate the ESIP Air Quality Workgroup which has been and ongoing collaboration forum over four years, with particularly active and broad participation over the last two years. We anticipate this ACCESS proposal to help continue the coordination and growth of the ESIP Air Quality Workgroup. The ESIP Air Quality Workgroup is a crucial partner in the proposed effort. In fact, we initially submitted this proposal's Notice of Intent with ESIP listed as the PI. However, it was determined that ESIP serving as PI introduced some conflicts with ESIP policies and by-laws and ESIP is now represented in the project as a coordination and collaboration environment in which to develop, test, and refine uFIND.

The AQ uFIND will be registered as a component in the GEOSS Component and Service Registry (CSR). The GEOSS Clearinghouses query the GEOSS CSR for catalogs and then periodically harvest the catalogs for their metadata records. From the point of view of the GEOSS Common Infrastructure, this particular functionality makes uFIND an Air Quality community catalog.

Extensions of Past Work and Impact of this Work

The proposed work is an extension of past research efforts conducted since 2001 on the development of the federated data system, DataFed (DataFedwiki; Husar, 2007; Husar, 2006; Husar, 2005). The DataFed development was supported by NSF, EPA as well as through the 5-year NASA REASoN grant, 2004-2009, "Application of ESE DATA and Tools to Particulate Air Quality Management." Over 100 standards-based datasets for air pollution data mediated through DataFed have been accessed by over a thousand repeat users from throughout the world (Fig.5).
DataFed Analytics.png
Figure 5. Location of top 100 users from Jan 1, 2007-June 20, 2009. Captured with Google Analytics

DataFed was also the data access system for the development and subsequent application of EPA's Exceptional Event Rule. The DataFed services have provided the key data source for many interoperability experiments for data processing and other workflow applications conducted through ESIP and GEOSS. Most recently, DataFed was a key contributor as part of the GEOSS Architecture Implementation Pilot (AIP)-II discussed in the Workplan section of this proposal.

Relevance to NASA Programs

In meeting its goal to, "Study Earth from space to advance scientific understanding and meet societal needs", the 2006 NASA Strategic Plan emphasizes that, "as new types of Earth observations become available, information systems, modeling, and partnerships to enable full use of the data for scientific research and timely decision support will become increasingly important." The proposed Air Quality uFIND is a tool to the ability for NASA and its data users to use existing and new data products and to develop partnerships to increase the usefulness of NASA data products in air quality decision support.
The Decadal Survey published by the National Academies of Science outlines objectives for future satellite earth observation missions for NASA and NOAA. It also stresses the importance of adequate information systems in order to make use of those new observations by concluding that, �fundamental improvements are needed in existing observation and information systems because they only loosely connect three key elements: (1) the raw observations that produce information; (2) the analyses, forecasts, and models that provide timely and coherent syntheses of otherwise disparate information; and (3) the decision processes that use those analyses and forecasts to produce actions with direct societal benefits.� The proposed project helps couple these three elements more closely by making it easier for earth observations users who create analyses and forecasts to find and access earth observations, thereby improving their ability to support decision processes.

The Air Quality uFIND is a demonstration of a modern, service-oriented architecture and the application of a small set of interoperability technologies that allows the creation of distributed, but interoperable data systems. While the particular application is for the domain of air pollution and atmospheric composition, the architecture, the technologies as well as the tools and methods are directly applicable to other science domains of interest to NASA such as climate change.

The Air Quality uFIND constitutes a conduit through which datasets registered in GCMD and other data directories can be delivered closer and easier to the air quality science and management community. Furthermore, the direct, two-way network link between uFIND and NASA data portals such as Giovanni constitute convincing demonstration of the viability of service-oriented, data networking.

Relevance to NRA Objectives

The uFIND infrastructure and the associated tools and methods constitutes a direct contribution to the ACCESS objective: to develop a "means for users to discover and use services being made available by NASA, other Federal agencies, academia, the private sector and others." Assessing the total life-cycle cost of this development is difficult since it's development has been pursued for about a decade and the process of service oriented data sharing will continuing the development well beyond this two year project. However, this is an important phase in the evolution of data networking because the applications have reached levels of TRL 7 or higher. This means that data from distributed providers can now be reliably and persistently found and accessed. In fact, we anticipate that at the end of the project, uFIND will incorporate several thousand air quality data layers that can be queried by sharp filters and immediately incorporated into browsers and processing applications.

The design of uFIND is initially targeted for the AQ-relevant datasets from NASA and the over hundred datasets already registered in DataFed as services. However, discussions are in progress for the implementation of a uFIND node as part of EPA's Exchange Network. Conceivably, additional uFIND nodes could be established at NOAA as well as at international agencies in Europe and Asia. These additional future activities are anticipated to be conducted in the architectural framework of GEOSS and uFIND nodes would constitute the components of a distributed Air Quality Community Catalog. Hence the life-cycle cost of the system will be the combined contributions for the system development and maintenance of the investments integrated over the time period of its evolution as well as over the multiple contributions by the community of its participants.

uFIND is not appropriate for all datasets. The design of uFIND is inherently targeting datasets that are expected to have the quality and value such that multiple applications can benefit from its use. Creating extensive metadata and provision of standard data access interface may be inapproriate for datasets that are of limited applicability for re-use. uFIND is also inappropriate for complicated datasets such as aircraft sampling data. Furthermore, uFIND is not suitable for the delivery of raw sampling calibration datasets.

General Work Plan

The basic workplan for the proposed 2-year project consists of the following activities:
Year 1: Focus on establishing connections to uFIND system components and their interactions. Increased connections to data providers.
  • Survey of data directories and data hubs for relevant AQ data
  • Develop community process for selecting air quality datasets
  • Continue the testing and improvement of WCS-netCDF data wrapper
  • Facilitating WCS interface to selected datasets
  • Continue to develop the metadata mapping tool for easy creation of AQ metadata records
  • Facilitate registration of AQ datasets in uFIND
  • Implementation of DataSpaces for key datasets

Outcome of year 1: AQ-relevant datasets that can be found through sharp queries of the metadata and accessed through OGC Standard Data Access Services

Year 2: Focus on testing and refining uFIND
  • Users test the data access system, based on the feedback flaws in the access services are eliminated.
  • Users test uFIND for usability flaws in metadata or connectivity modified
  • Connection and testing with uFIND and DataSpaces for compatibility and utility
Outcome of year 2: a well-tested and robust uFIND consisting of thousands of AQ-relevant data layers
It is anticipated that throughout this project collaboration would occur thorugh the GEOSS ADC AIP-III, ESIP and EPA's Exchange Network.

Current State of Application

The initial demonstration of uFIND was prepared as part of the GEOSS Architecture Implementation Pilot (AIP-II) in 2008-2009. During that pilot, the components shown in Figure 4 were functional and connected to the GEOSS Clearinghouse (GEOSS Clearinghouse FGDC, 2009). In fact, uFIND served as the Air Quality Community Catalog that was routinely harvested by the clearinghouse. The metadata for the air quality data records has been developed, such that faceted search on about 500 data records could be executed. During AIP-II the data access services were confined to OGC WMS images. WCS data services from providers other than DataFed were not available.

As part of this project the uFIND pilot implementation will be expanded in a number of ways such that at the end of the project uFIND will be a complete and robust package of services and tools, with graphic user interfaces. The specific developments in year 1 will include the connection to data directories and user-driven selection of several thousand data layers offered by distributed providers. A major effort will be invested in helping data providers to expose their data through the WCS standard interface. The WCS-netCDF data wrapper will undergo considerable testing and further development of functionality and usability. The metadata preparation tools will be expanded to accommodate additional facets for searching and also links to additional metadata such as data provenance. The uFIND search engine will be expanded for a broader range of facets and also for alternative user interfaces. The substantial development will consist of linking uFIND data records to data browsers such as Google Earth for spatial views and time charts for temporal data views. The interfaces of uFIND will also be streamlined to improve the incorporation of uFIND outputs to other client applications such as portals.

Management Approach

The major part of the proposed project will be performed by the core group of investigators listed in the table below. Professor Husar is the Principal Investigator and the architect of the federated data system, DataFed. He has long-term interest in environmental informatics with particular emphasis on tools for data access and exploration. Professor Falke is the Co-Investigator and expert in geospatial informatics. He has also been a community leader for the ESIP Air Quality Workgroup and other community activities. Erin Robinson is a doctoral candidate in the School of Engineering and has years of experience in environmental data analysis, and tools and technologies to assist collaborative research. Participation in this project will consititute a significant contribution to her Ph.D. thesis. Kari Hoijarvi is the chief software designer, programmer of the major software systems developed at CAPITA since the 1990s. His deep understanding of Service Oriented Programming and data processing will be key to the development of uFIND. Ed Fialkowski is a software developer with experience in networking, portal development and other applications development.

Name Organization Role/Contribution
Rudolf Husar Washington Univ. PI, uFIND Architect
Stefan Falke Washington Univ. Co-I, ESDWG, DataSpaces Connection
Erin Robinson Washington Univ. Graduate Student, uFIND Metadata
Kari Hoijarvi Consultant Finder Developer, WCS Server Developer
Ed Fialkowski Washington Univ. Metadata Tools Developer
The proposed project will rely heavily on the community contributions of the ESIP Air Quality Workgroup, the GEOSS Architecture Implementation Pilot Community and other communities such as the group associated with EPA's Air Quality Data Summit.

As in the past projects, this work will be performed as an open process with active participation of interested members from these communities. The community support will include design guidance, technology contributions and most importantly in establishment of service connections between uFIND and upstream and downstream services.

Data Sharing Plan

The entire basis of the Air Quality uFIND is data sharing, so the work plan outlined above is, in essence, our data sharing plan. However, we highlight certain data sharing aspects of the proposed approach here including those aspects that help other projects and programs enhance their data sharing capabilities. The data reuse will be enhanced through the service oriented architecture. The registered datasets are also directly accessible to air quality specific, work-flow based clients which can perform value-adding data processing and analysis.

Sharing research results with communities is vital to the continued development of uFIND and the infrastructures in which it is used. We will expose our results at ESTO technology conferences, AGU meetings and other forums conducive to new interactions with data providers and users. We will gain additional interaction with the earth science community through our continued leadership in the Earth Science Information Partners Federation (ESIP) and through GEO-related activities.

Operations Concept

In contrast to a stand-alone project that relies on its own individual initiatives to achieve persistence or sustainability, the Air Quality uFIND is expected to be closely coupled with other, larger efforts that provide opportunities for continued operation by various domain communities. For example, GEOSS is gradually being developed by assembling and developing components of the overall system and bringing them together. While a long range operations plan is yet to be defined, the expectation is that the successful development will lead to a sustainable infrastructure. The Air Quality uFIND is aimed at providing a critical component to the use of GEOSS by the air quality community and as a result, as the GEOSS operations plan is developed, uFIND will be part of that. A suggested approach for the continued operation of uFIND is through collaborations and resource sharing. Operations will be co-supported by others that would adapt it - in the U.S. EPA's Exchange Network.


Air Quality and Health Working Group (GeossPilot2). Available at: [Accessed June 26, 2009].

Air Quality Work Group - Federation of Earth Science Information Partners. Available at: [Accessed June 26, 2009].

Cleveland, Harland, Information as a Resource, Futurist, 16, 34-39, 1982.

Data Summit Workspace - Federation of Earth Science Information Partners. Available at: [Accessed June 26, 2009].

Evidence for Flagging Exceptional Events - Federation of Earth Science Information Partners. Available at: [Accessed June 26, 2009].

GEO Task US-09-01a Home Page. Available at: [Accessed June 26, 2009].

GEO User Requirements for Air Quality - Federation of Earth Science Information Partners. Available at: [Accessed June 26, 2009].

GEOSS-CLEARINGHOUSE: Common Search Facility. Available at: [Accessed June 26, 2009].

Husar, R., Falke and K. Hoijarvi: Interoperability of Web Service-Based Data Access and Processing: Experience Using the DataFed System. ESTO Meeting, 2006. Paper A6P2.

Husar, R. & Poirot, R., 2005. DataFed and FASTNET: Tools for agile air quality analysis. EM-PITTSBURGH-AIR AND WASTE MANAGEMENT ASSOCIATION-, 39.

Husar, R.B. & Hoijarvi, K., 2007. DataFed: Mediated web services for distributed air quality data access and processing. In IEEE International Geoscience and Remote Sensing Symposium, 2007. IGARSS 2007. pp. 4016-4020.

Husar, R.B. et al., 2008. DataFed: An Architecture for Federating Atmospheric Data for GEOSS. IEEE Systems Journal, 2(3), 366-373.

INSPIRE geoportal. Available at: [Accessed June 26, 2009].

Research, P.O.I.T.A.T.C.O. et al., 1989. Information Technology and the Conduct of Research: The User's View, National Academy Press.

Robinson, E.M. & Husar, R.B., 2008. DataSpaces: Using Community Workspaces to Enable Rich Air Quality Metadata. In American Geophysical Union, Fall Meeting 2008, abstract# IN22A-06.

Taylor, R.S. & Voigt, M.J., 1986. Value added processes in information systems, Greenwood Publishing Group Inc. Westport, CT, USA.

uFIND Pilot. Available at: [Accessed June 26, 2009].