Difference between revisions of "NASA ACCESS09: Tools and Methods for Finding and Accessing Air Quality Data"

From Earth Science Information Partners (ESIP)
 
(27 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
{{NASA_ACCESS_AQIP_Backlinks}}<br>
 
{{NASA_ACCESS_AQIP_Backlinks}}<br>
 +
<font size="4">'''uFIND: User-oriented Tool Set for Air Quality Data Discovery and Access'''</font>
  
== Tools and Methods for Finding and Accessing Air Quality Data ==
+
The objective of this work is connect air quality users to air quality-relevant data services at NASA and elsewhere through the implementation of a set of well-tested and effective services, tools and methods. The centerpiece of the proposed project is the web service-based tool set: user-oriented Filtering and Identification of Networked Data (uFIND). The purpose of uFIND is to provide rich and powerful facilities for the user to: discover and choose a desired dataset by navigation through the multi-dimensional metadata space using faceted search, seamlessly access and browse datasets and use uFINDs facilities as a web service for the mashups with other AQ applications and portals. Datasets found through uFIND will be accessible through standard OGC WCS, WMS data access protocols. A major perceived contribution of the proposed work is an infrastructure for the harvesting, harmonization and continuous augmentation of metadata through the metadata flow system. The metadata will follow the ISO 19115 standard for geospatial metadata. The Service Oriented Architecture of uFIND will allow service-based interfacing with providers and users of the metadata. uFIND will be applicable to other Earth Science domains and be instrumental in the emerging Global Observing System of Systems (GEOSS) and in particular in the development of the GEOSS Common Infrastructure (GCI)
Short: Tools for Finding and Delivering Air Quality Data
 
 
This proposal is in response to the solicitation: [http://nspires.nasaprs.com/external/viewrepositorydocument/cmdocumentid=176847/A.34%20ACCESS%20corrected.pdf Advancing Collaborative Connections for Earth System Science (ACCESS) 2009]. In particular, it focuses on providing "means for '''users to discover and use services''' being made available by NASA, other Federal agencies, academia, the private sector, and others". This proposal is offering tools and methods for data access and discovery services for Air Quality-related datasets. However, the tools, methods and infrastructure should be applicable to other domains of science and application. 
 
  
There are major impediments to seamless and effective data usage encountered by both data providers and  users. The impediments from the user's point of view are succinctly stated in the report by NAS (1989), in short: the '''user can not find the data''', if she can find them, she '''can not access''' them, if she can access then, she does not '''know how good they are''', if she finds the data good, she '''can not merge''' them with other data. The data provider face a similar set of hurdles:  the provider '''can not find the users''', if she can find users, she does not know how to seamlessly '''deliver the data''', if she can deliver, she does not know how to '''make them more valuable''' to the users. This project intends to provide support to overcome the first two hurdles: finding the user/provider and accessing/delivering the desired data.   
+
----
 +
= <span class="Z3988" title="url_ver=Z39.88-2004&ctx_ver=Z39.88-2004&rft_val_fmt=info2Ffmt3Amtx%3Abook&rft.genre=book&rft.btitle=Value20processes20information%20systems&rft.publisher=Greenwood20Group20Westport20CT20USA&rft.aufirst=R.%20S.&rft.aulast=Taylor&rft.au=R.20Taylor&rft.au=M.20Voigt&rft.date=1986"><font size="4">Scientific/Technical/Management Section</font></span> =
  
 +
== Objectives ==
  
 +
This proposal is in response to the solicitation: Advancing Collaborative Connections for Earth System Science (ACCESS) 2009. In particular, it focuses on providing "means for users to discover and use services being made available by NASA, other Federal agencies, academia, the private sector, and others". The objective of this work is to implement a set of well-tested and effective services, tools and methods for the discovery and seamless access to air quality-related datasets.
  
 +
== Background ==
  
===Approach ===
+
<div>Recent developments offer outstanding opportunities to fulfill the information needs for Earth Sciences and support for many societal benefit areas. The satellite sensing revolution of the 1990's now yield near-real-time observations of many atmospheric parameters. The data from surface-based monitoring networks now routinely provide detailed characterisation of atmospheric and surface parameters. The �terabytes� of data from these surface and remote sensors can now be stored, processed and delivered in near-real time and the instantaneous �horizontal� diffusion of information via the Internet now permits, in principle, the delivery of the right information to the right people at the right place and time. Nevertheless, atmospheric scientists and air quality decision makers face significant hurdles. The production of atmospheric observations and models are rapidly outpacing the rate at which these observations are assimilated and metabolized into actionable knowledge that can produce scientific understanding and societal benefits. The �data deluge� problem is especially acute for atmospheric scientists interested in the use of satellite observations. As a consequence, most atmospheric observations are significantly under-utilized. <br /></div><div>In the U.S., virtually all the important Earth Science datasets including air quality-relevant observations are now publicly accessible through the Internet. Typically, data providers place the numeric datasets onto a web server along with the associated 'Readme' files, which contains descriptive information about the data, access instructions and other metadata. Over the past decades, there was also a steady evolution of data directories where individual datasets could be registered, labeled and searched by potential users. The outstanding example of a long-term cataloging effort is NASA's Global Change Master Directory (GCMD) which illustrates the data deluge problems. A search of GCMD for "atmosphere" shows about 8000 entries for datasets and 1000 entries for services. Even an air quality parameter "aerosol" returns about 1000 datasets and 500 services. The directories of other agencies and nations probably contains at least this many entries. </div><div><br /> In GCMD user support for finding a dataset is provided primarily through a controlled vocabulary of keywords along with free text search of the DIF-formatted standard metadata. Providers of Earth Observations can easily publish the data in GCMD and they only have to do it once. On the other hand, finding and accessing and using any given dataset takes a much larger effort and it is repeated individually by every user of a given dataset. Collectively, the users of a dataset may expend 100 or 1000 times more effort than a publisher of a dataset. As pointed out by Cleveland (1982), in an information rich environment most of the burden of accessing the proper information is carried by the user, not the producer. According to Taylor (1986) this type of information system is classified as provider (content) driven information system, distinctly different from a user driven Information system. </div>
In the proposed data discovery system, the unit data is a 'layer', i.e. a single measured parameter obtained by a given instrument. A layer can represent data from a surface monitor, column measurement from a satellite (e.g. AOT) or a modeled parameter. Data layers originating from the same instrument are grouped into datasets (e.g. MODIS). For each data layer and dataset there is an ISO 19115 metadata record that describes the respective characteristics. The content of the metadata is then exposed to a harvesting (or to a query) service which collects the metadata from distributed providers into the master catalog.  
 
  
The user accesses the master catalog, navigates and explores the master catalog using advanced faceted search service, data browsing and rich metadata resources.  
+
The center piece of the proposed project is the web service-based tool set: '''u'''ser-oriented '''F'''iltering and '''I'''dentification of '''N'''etworked '''D'''ata (uFIND). The purpose of uFIND is to provide rich and powerful facilities for the user to discover and choose a desired dataset. These facilities include navigation through the multi-dimensional metadata space through the faceted search (uFind Pilot, 2009), ability to seamlessly access and browse datasets and to use uFINDs facilities as a web service for the mashups with other AQ applications and portals.
  
Focus on '''impediments'''!!! Why cant we Publish, Find, Bind???
+
== Expected Significance ==
Invest energy into generic tools to aid Publish, Find, Bind. 
 
  
* All data are to be accessible through WMS WCS; data browser, exploration is part of the data discovery process
+
<div>The outcome of the proposed work will significantly reduce the burden on the user in finding and accessing data relevant to the understanding and management of air quality and atmospheric composition. Air quality is the scientific and research domain of the proposing team and the proposed IT technologies and infrastructure will be directly applicable to the furthering of research for this domain. Air quality is also a significant application area of satellite observations. The tools, methods and infrastructure will also be applicable to other domains of Earth Sciences.<br /><br /> The AQ data user benefits from uFIND by making it easier to find suitable data and accessing and utilizing the data in creating either scientific or actionable knowledge. The providers of AQ will also benefit since their data products will find easier re-use enhancing the relevance and importance of their products. Finally, the earth science community at large will benefit from the proposed work by the tools and methods for improved data discovery and access that is applicable to all earth science domains. The broad applicability of this proposed work will also be instrumental in the emerging Global Observing System of Systems (GEOSS) and in particular in the development of the GEOSS Common Infrastructure (GCI).<br /></div>
* Metadata are GEO - compatible ISO and unstructured
 
* Rich metadata collected from providers and users as well as from usage statistics..
 
* Metadata the messaging connection and glue that connects data providers, users and mediators
 
 
  
Project can interact with community resources of ESIP WGs, GEO WGs, ....:
+
== uFIND Use Case ==
* Many thousands of diverse AQ data layers, tagged (classified). An ideal resource for a semantic uses case. (Semantic Group)
 
* Since data will pass through many hands, ideal use case for design and testing of provenance (Greg)
 
* Connecting data providers with users through workspaces ... )Stefan)
 
* EPA, NOAA and other data are already 
 
  
 +
<div>The utility of the AQ uFIND is illustrated below through a use case applicable to air quality management. On May 12, 2007 an intense wildland fire broke out at the Okefenokee National Park in southern Georgia. The intense smoke from the fire engulfed southern Georgia and drifted toward Florida. By May 24, the smoke pall was also drifting northward along the Mississippi Vally, impacting much of the eastern U.S. The smoke increased the ambient aerosol concentration well over the ambient air quality standard (35ug/m<sup>3</sup>) causing exceedances. The air quality agencies in the smoke-impacted states were required to document that the exceedences were due to the Exceptional Event (EE) from the Okefenokee fires. In order to provide the required documentation, an air quality analyst of the impacted state would use uFIND (Fig.1). </div>[[Image:AQDataFinder.png|400px]]<br><font size="1">Figure 1. Pilot Air Quality uFIND, http://webapps.datafed.net/geoss_catalog.aspx<br /></font><div>In the first step, the analyst would define the spatial and temporal domain of the area of smoke impact: Lat 15 to 45, Lon -90 to -65, 2007-05-27. Under the search facet 'Domain' clicking 'Fire', would reveal the fire observations for that geographic area and time range. The selection shows that the dataset: NOAA_HMS_WFS from satellite platforms contains multiple fire-related parameters arising from a variety of sensors (AVHRR, GOES, MODIS). After a few minutes of browsing each parameter through the Data Viewer integrated in uFIND, the analyst chooses the parameter HMS_All to represent the fire locations. Next, the analyst chooses the 'Aerosol' domain to explore the various aerosol datasets and parameters, which yields hundreds of parameter records from numerous datasets. In order to confine the search, the analyst chooses 'Satellite' from the 'Platform' facet and 'Day' from 'Time Resolution' facet, which still yields dozens of satellite-derived parameters. Since her interests is in the geographic location of smoke, she chooses the parameter, AOD (Aerosol Optical Depth), which occurs in two datasets, one from MODIS4_AOT and Seaw_US. In order to assure that the satellite aerosol signal is due to light absorbing smoke, she also examines the parameter, 'ABS_AER', which is a smoke indicator in the OMI_AI_G dataset. Each of the selected datasets are accumulated as layers in the Data Browser.<br /><br /> As a next step, the analyst examines the magnitude of the smoke impact on the surface concentration of aerosols. Clicking 'Network' on the Platform facet and 'Hour' on the Time Resolution facet and 'Surface' on the Vertical facet returns all the hourly measured parameters on the surface, which can be found in three datasets: Airnow (EPA's real-time, surface monitoring network), AQS_H (EPA's additional hourly data) and Surf_Met (National Weather Service meteorological observations). For the smoke impact analysis, the analyst chooses the PM2.5 from Airnow and AQS_H and RHBext from SurfMet. These surface observations are also added to the stack of data layers in the Data Viewer. Having the location of the fires identified by the satellite-derived fire pixels, the smoke dispersion pattern from the satellite AOD and confirmation from the absorbing aerosol index she has sufficient documentation to satisfy the first requirement of the Exceptional Event Rule: Evidence that the smoke event occurred and that the smoke was observed in impacted areas. </div><div></div><div>Subsequently the examination of the surface observations reveals whether the satellite-observed smoke had a corresponding observed impact on the surface concentration. Through additional use of the available chemical composition data, she can also document that the excess aerosol mass concentration during the even can be attributed to smoke organic compounds. </div><div></div><div>The above described scenario is realistic and it was taken from an actual Exceptional Event case that was analyzed for EPA (<font size="1"><font size="2">Evidence for Flagging Exceptional Events</font> </font>, 2008). The original time expenditure for those analyses required weeks of effort by experts to find, access and analyze the data. With the aid of the new uFIND System, the same activity can be performed in a matter of hours and the resulting report can be more thorough because of the broader data pool available. We anticipate that similar time-saving and overall productivity improvement can be achieved in other areas of air quality science, management and policy. A particularly suitable application area is the international policy development regarding the Hemispheric Transport of Air Pollutants (HTAP), where satellite observations are crucial for documentation of intercontinental transport. </div>
  
 +
== Technical Approach and Methodology ==
  
====Architectural Approach:====
+
Information systems are practice driven. There are no natural laws to follow, the field is in pre-scientific age. Major input to the information system must come from an analysis of the information use environment, that a) Establishes the information flow into, within and out of an entity; b) Determines the criteria by which the value of information is judged. (Taylor, 1986).<br />
The architectural basis of the proposed work is Service Oriented Architecture (SOA) for the '''publishing''', '''finding''',  '''binding to ''' and '''delivery''' of data as services. The critical aspect of SOA is the loose coupling between service providers and service users. Loose coupling is accomplished through plug-and-play connectivity facilitated by standards-based data access service protocols. SOA is the only architecture that we are aware of that allows seamless connectivity of data between a rich set of provider resources and diverse array of users. Service providers registers services in a suitable service registry, the users discover the desired service and access the data
 
The result is a dynamic binding mechanism for the construction of loosely-coupled work-flow applications.  
 
  
Service orientation, has been accepted as the desired way of delivering Earth Observation (EO) data products. However, formal standards-based data-as-service offerings have been slow delivery of data products in NASA and other Agencies. While offering images through OGC WMS standard interface is becoming common for many Federal Agency data products, there is currently no effective way for the users to find those services advertised and exposed over dispersed over the web pages.  
+
<div><br /> There are major impediments to seamless and effective data usage encountered by both data providers and users. The impediments from the user's point of view are succinctly stated in the report by Research, P.O.I.T.A.T.C.O (1998), in short: the user can not find the data, if she can find them, she can not access them, if she can access then, she does not know how good they are, if she finds the data good, she can not merge them with other data. Similarly, the data providers face a similar set of hurdles: the provider can not find the users, if she can find users, she does not know how to seamlessly deliver the data, if she can deliver, she does not know how to make them more valuable to the users. This project intends to provide support to overcome the first two hurdles: '''finding '''and '''accessing''' the air quality data.<br /></div>
  
 +
=== Architecture and Technologies Used<br /> ===
  
 +
<div>The center piece of the proposed project is the web service-based tool set: '''u'''ser-oriented '''F'''iltering and '''I'''dentification of '''N'''etworked '''D'''ata (uFIND). The purpose of uFIND is to provide rich and powerful facilities for the user to discover and choose a desired dataset. These facilities include navigation through the multi-dimensional metadata space through the faceted search (uFind Pilot, 2009), ability to seamlessly access and browse datasets and to use uFINDs facilities as a web service for the mashups with other AQ applications and portals.<br /> s (Fig.2).
 +
[[Image:SOA_Metadata.png|400px]]<br>
 +
<div><font color="#ff0000"><font size="1" color="#000000">Fig 2. Service Oriented Architecture and ISO 19115 Metadata Schematic</font> </font><br /></div><div></div><div><font color="#000000">T</font>he critical aspect of SOA is the loose coupling between service providers and service users. Loose coupling is accomplished through plug-and-play connectivity facilitated by standards-based data access service protocols. SOA is the only architecture that we are aware of that allows both loose, dynamic connection and seamless flow of data between a rich set of provider resources and diverse array of users. In SOA service provider registers services in a suitable service registry, the users discover the desired service retrieve an access key and seamlessly access the data. The result is a dynamic binding mechanism for the construction of loosely-coupled work-flow applications. </div><div><br /> Service orientation, has been accepted as the desired way of delivering Earth Observation data products. However, the adoption of formal standards-based data-as-service offerings have been <font color="#000000"><span><font color="rgb(0, 0, 0)">slow within</font></span> </font>NASA and other Agencies. Offering images through OGC WMS standard interface is becoming common for many Federal Agency data products, but there is currently no effective way for the users to find those services since they are dispersed over many web pages. </div><div><br /> This project will rely on mature and widely used standard protocols for interoperability:<br /> (1) OGC WMS, WCS and WFS web services for accessing data;<br /> (2) netCDF-CF as standard data formats for gridded and point monitoring data and models<br /> (3) ISO 19115 for geospatial metadata<br /> (4) RSS/Atom and HTTP for inter-service message transfer. (Fig. 4)<br /><br /> Furthermore, the uFIND software system will be implemented using three key maturing 'intellectual technologies': (1) Tagging for flexible, user-extensible structuring and annotation of diverse metadata; (2) Faceted search technology for navigation through multidimensional data discovery and (3) Ajax-based dynamic user interface to the data discovery and spatio-temporal data browsing and exploration.<br /><br /> The individual technologies described above can be considered at a technology readiness level (TRL) of at least 7. They have been developed and implemented in near operational environments. Many are used routinely in web data and information exchange among scientific and public domain communities. The work described in this proposal is targeted at combining those technologies and techniques in order to provider a user-driven uFIND.<br /></div>
  
====Engineering and Implementation:====
+
=== Engineering and Implementation ===
The three major components of this SOA-based project are: (1) facilities for publication of data services (2) facilities to find data services and (3) facilities to access data services. Everything is a service. mash ups to other systems.  Connectivity-workflow.  The engineering design of the proposed work is Each of these components will be supported by a set of tools and methods.
 
  
[[Image:MetadataFlow.png|500px]]
+
<div>The three major components the proposed uFIND, SOA-based software are: (1) facilities to identify AQ-relevant data services (2) facilities for publication of data as a services and (3) facilities for AQ users to find and access data services. In this design, each component is a service which allows the reuse of any part of the system in mashups with other web-based applications. Each of these major components will be supported by a set of tools and methods.<br /></div>
Designed for lateral connectivity and expanson.  
 
 
  
====Technology Approach:====  
+
==== User-Oriented Filtering of Relevant Datasets ====
The project will rely on three key mature and widely used standard protocols for interoperability : (1) OGC '''WMS, WCS''' and '''WFS''' web services for accessing data; (2)ISO 19115 for geographic metadata and (3) RSS/Atom and HTTP for inter-service data transfer. Furthermore, the developing software is implemented using three key maturing 'intellectual technologies': (1) '''Tagging''' for flexible, user-extensible structuring and annotation of diverse metadata; (2) '''Faceted search''' technology for navigation through multidimensional data discovery and (3) '''Ajax'''-based dynamic user interface to the data discovery and exploration.
 
  
[[Image:MetadataTools.png|500px]]
+
<div>The AQ user needs will determine what datasets are initially targeted for inclusion in uFIND. The identified AQ Earth Observation priorities from the GEO Task US-09-01A (GEO Task US-09-01a Home Page, 2009) as well as other published, user-filtered dataset lists like the one prepared by Scheffe for EPA will guide the human and machine surveying of catalogs like GCMD, the GEOSS Clearinghouse, GeoSpatial OneStop, etc. in order to identify potential datasets. The identification of AQ-relevant datasets will be performed by the proposed team. However, it will expand to be a communal process. Initially, the proposing team, with the ESIP AQ Workgroup will work with AQ data providers and AQ users in order to populate uFIND with relevant AQ data. As the critical mass of participation is reached within the uFIND network it will allow the larger community of participants to act as a filter for what is valuable to the AQ Community.<br /><br /> Another batch of selected datasets that will come from several "data hubs" that already have a subset of datasets relevant to particular AQ analysts. Some examples are, EPA AQS for State Analysts, VIEWS for RPO's, NILU EMAP, NASA Giovanni as well as DataFed, developed by the proposing team. A full listing of current air quality data systems has been compiled during EPA's Data Summit in 2008 (Summit, 2008), which also contains many user recommendations. In uFIND these datasets will be harmonized, so that the data being offered through any of these hubs can be reused by multiple applications. </div><div><br /> It is hoped that uFIND will also attract potential AQ data providers from the AQ community to participate through the incentive of reaching more data users as well as the immediate reward that when the dataset is accessed through standard formats it immediately can be used with many viewers and other client applications. Participation in uFIND will also allow providers to review web analytic information and connect with air quality users of their data, receiving both direct and indirect feedback about their data products. </div>
  
====Management Approach:====
+
==== Data Access ====
The major Core group
 
Community - Find-bind ... was developend and tested durion AIP2 by the cooomm 
 
Collaborative,
 
  
Semantic people
+
<div>[File?id=ddxhh8bp_945g2x458hg_b ]The first condition of uFIND is providing data through standard interface. Datasets found through AQ uFIND will be accessible through standard OGC WCS, WMS data access protocols. This allows users to seamlessly access the data subset of interest. By adding standard interfaces, datasets can be converted into other user formats like netCDF, KML, accessible by data browsers for exploration. </div><div></div>
Provenence
 
Workflow
 
  
=== Data as Service ===
+
==== Data as Service ====
* Wrappers, reusable tools and methods (wrapper classes for: SeqImage/Seq File, SQL, netCDF) into WCS and WMS 
 
** Unidata (netCDF), CF (CF working group; Naming); GALEON (WCS-netCDF) building on top of...
 
  
=== Metadata as Service ===
+
<div></div><div>OGC WCS is particularly applicable for representing space-time-varying phenomena in Fluid Earth Sciences, atmosphere and oceans. OGC WCS version 1.1 is limited to grids, or "simple� coverages, with homogeneous range sets but future revisions of the standard are anticipated to include support a broader set of coverages, including point coverages.<br /></div>
* Metadata not service: readme file
 
* Metadata as service: Capability
 
  
 +
[[Image:WCS_DataAccess.png|400px]]<br />
  
*
+
<font size="1">Figure 3. a. WCS Communication Protocol b.WCS Data Wrappers </font><br />
WCS/WMS GetCapabilities Conventions that allow metadata reuse
 
* WCS Capabilitie expanded -> WMS (Combination of WCS and Render) (build metadata in 2 steps wcs and then augment with wms fields)
 
* WMS GetCapabilities > ISO Maker Tool publish metadata
 
  
=== Publishing (metadata -> WAF) ===
+
<br />
  
The data discovery through the clearinghouse is aided by ISO metadata for Geospatial data which is prepared for each dataset.
+
An attractive feature of these services is that (1) they can be executed using the simple, universal HTTP GET/POST Internet protocol; (2) the services are described by formal XML documents ("GetCapabilities", "DescribeCoverage") and the access instructions and output formats can be advertised in those service documents (Fig. 3a).
  
The metadata has the primary purpose to facilitate finding and accessing the data in order to help dealing with first two hurdles that the users face. Clearly, the air quality specific metadata such as sampling platform, data domain and measured parameters etc. need to be defined by air quality users. Dealing with the hurdles of data quality and multi-sensory data integration are topics of future efforts. <br>
+
===== Wrappers and Tools =====
  
The metadata is prepared by transforming and augmenting OGC GetCapabilities into ISO metadata records. The GetCapabilities document provides initial metadata for an ISO 19115 metadata record, through tools and methods provided this metadata is extended to include AQ-specific metadata. The ISO record is validated and saved into the AQ community catalog. The community catalog is registered as a component in the GEOSS Component and Service Registry (CSR). The GEOSS Clearinghouses query the GEOSS CSR for catalogs and then periodically harvest the catalogs for their metadata records, ending the metadata publishing process.  
+
Most air quality datasets reside on servers in individual files or in SQL servers. These datasets need to go through a 'wrapping process' (Fig. 3b). Wrappers are reusable interfaces that turn data in files into data as service. Wrapper classes exist for: Sequential Images and Files, SQL server databases and for netCDF data files.<br /><br />
  
[[Image:20090505 AIP2 Metadata Registration.png|400px]]
+
<div>A particularly important wrapper class is for data files formated according to the netCDF-CF convention. Special effort will be invested to create a portable software template for accessing netCDF-formated data using the WCS protocol. It is hoped that this tool will promote the access and ease of use of distributed data. <br /></div>
  
=== Finding (...) ===
+
<br />
The finding of air quality data is accomplished in two stages: a coarse filter generic to all Earth Observations through the clearinghouse and then a high resolution filter specific to air quality in the AQ Catalog Browser.
 
  
The GEOSS clearinghouse provides a coarse filter using generic discovery metadata to find Earth Observations. However, the clearinghouse exposes a search API which enables machine queries to be made. The returned records can be further searched using the entire metadata record originally submitted allowing a more refined search specific to a particular community. The AQ Community has built a catalog browser interface to the clearinghouse which enables this two step process. After the initial coarse filter, the returned records are browsed using a customized, faceted search interface was built to search the extended AQ metadata and find AQ data using specific filters such as sampling platform and data structure. Finding the right data is further enhanced by the user's ability to immediately view data as WMS through multiple clients provided by ESRI, Compusult and others. This proposal allow query results to be embedded in another web page. The proposal will also link to multiple available WMS viewers to browse layers available in catalog.
+
The netCDF-CF file format is a common way of storing and transferring gridded meteorological and air quality model results. The CF convention for structuring and naming of netCDF-formated data further enhances the semantics of the netCDF files. The netCDF CF data format is most useful for the exchange of multidimensional gridded model data. It was also demonstrated that the netCDF format is well suited for the encoding and transfer of station monitoring data. Traditionally, satellite data were encoded and transferred using the HDF format. The new netCDF version 4 (beta) library provides a common API for netCDF and HDF-5 data formats. The netCDF-CF data format is supported by a robust set of well-documented and maintained low-level libraries for creating, maintaining and accessing data in that format for multiple platforms (Linux, Windows). The low level libraries provided by UNIDATA also offer a clear application programing interface (API).<br /><br /> The WCS wrapper for netCDF software has the triple functionality: (1) Accessing netCDF-CF files contents over the HTTP Get Internet protocol; (2) Imposing a standard data query language using the WCS standard; (3) Allow easy (non-intrusive) adaptation to evolving standards. The main components of the wrapper software are shown schematically in the Figure left. At the lowest level are open source libraries for accessing netCDF and XML files. At the next level are Python scripts for extracting spatially subset slices for specific parameters and times. At the third level, is the WCS interpreter that parses the WCS url. The Capabilities and Description files are created automatically from the NetCDF files, but you can provide a template containing information about your organization, contacts and other metadata. The WCS - netCDF-CF wrapper is a communal activity initiated by our group and pursued by collaborators from the US, New Zealand, Germany and elsewhere. It is also a participatory 'project' within the ESIP Air Quality Workgroup. It is anticipated that the wrapper will find wide application in this project.
  
Additionally, the finding process will be augmented through web-based analytics that monitor user activity of the AQ catalog. These analytics will highlight where users come from spatially and virtually (i.e. by search engine, link...) as well as what datasets, spatial and temporal domain and parameters are most viewed or what datasets were viewed together. This feedback will help to hone the catalog to provide more useful information to users such as more datasets in a certain domain or tips for what others who viewed this dataset also viewed. The metrics will also help providers by identifying who some of the users of the data are.
+
==== AQ uFIND Metadata ====
  
[[Image:20090505 AIP2 ADC UICSlide8.PNG|400px]]<br>
+
<div>The metadata has the primary purpose to facilitate finding and accessing the data in order to help dealing with first two hurdles that the users face. In current catalogs metadata is provided by the provider or distributor of the data. This metadata includes intrinsic discovery metadata such as spatial and temporal extent, keywords and contact information for the provider. The metadata also includes distribution information for data access. Additionally, providers include various other information to help users once they are at their site. This approach is provider-driven and does not incorporate the user needs.<br /><br /> In a user-centric information system the user experience is improved by metadata contributed along the entire line of data usage from the providers to the users. Furthermore, the datasets need to have additional AQ-relevant metadata added to metadata that the provider gives in order for the AQ user to easily find the data. This additional metadata allows for sharp queries to be given in the parameter space, time, and physical space. Another feature of the user-centric system is that using web analytics additional metadata is attached to each dataset in order to provide information about dataset usage characteristics.<br /><br /> The uFIND system incorporates the structured metadata along the data usage chain using the ISO 19115, Metadata for Geospatial Data Standard. ISO 19115 is ideal because it's structure accounts not only for traditional data access and discovery metadata, but also for usage, lineage and other metadata needed for understanding the data. The uFind record also includes pointers to additional metadata resources like a pointer back to the original metadata record from the data provider and a pointer to the associated DataSpace where the metadata record can be viewed.The ISO 19115 is also registered in the GEOSS Standards Registry, which allows uFIND to be harvested by the GEOSS Clearinghouse and found through other portals.<br /></div>
  
===Binding ===
+
===== Metadata Flow Architecture =====
  
Once the data are accessible through standard service protocols and discoverable through the clearinghouse they can be incorporated and browsed in any client application including the ESRI and Compusult GEO Portals. <br>
+
The major perceived contribution of the proposed work is an infrastructure for the harvesting, harmonization and continuous augmentation of metadata through the metadata flow system tailored to the use in air quality applications. A design of the metadata system is shown schematically in Figure 4.  
  
The registered datasets are also directly accessible to air quality specific, work-flow based clients which can perform value-adding data processing and analysis.
+
[[Image:MetadataFlowSchematic.png|400px]]
 +
<br /><font size="1">Figure 4. Schematics of the uFIND metadata flow and components</font><br /><br /> The description below will follow the components schematics from left to right. It is to be noted that the above schematics is the representation of an actual functioning prototype that has been implemented for the GEOSS Architecture Implementation Pilot-II (GEO, 2009). However, it is anticipated that both the individual components as well as the overall functionality of the metadata flow infrastructure will continue to evolve during this project.<br />
  
The loose coupling between the growing data pool in GEOSS and workflow-based air quality client software shows the benefits of the Service Oriented Architecture to the Air Quality and Health Societal Benefit Area.  <br>
+
===== Provider-Contributed Metadata =====
 
Add new browser and client
 
  
==Background==
+
The components to the left indicates the providers that are anticipated to supply metadata. Each supplier is connected to uFIND through two processes. The first step is to filter available metadata resources and select the records that are relevant to air quality. The second step is to transform the diverse source metadata to the uniform AQ metadata records adapted by uFIND (ISO 19115). It should be recalled that a pre-condition for registering an air quality-relevant dataset in uFIND, is that the data are accessible through an OGC standard data access protocol.<br />
  
Recent developments offer outstanding opportunities to fulfill the information needs for Earth Sciences and support for many societal benefit areas. The satellite sensing revolution of the 1990's now yield near-real-time observations of many Earth System parameters. The data from surface-based monitoring networks now routinely provide detailed cgaracterisation of atmospheric and surface parameters. The ‘terabytes’ of data from these surface and remote sensors can now be stored, processed and delivered in near-real time and the instantaneous ‘horizontal’ diffusion of information via the Internet now permits, in principle, the delivery of the right information to the right people at the right place and time. Standardized computer-computer communication languages and the emerging Service-Oriented information systems now facilitate the flexible processing of raw data into high-grade scientific or  ‘actionable’ knowledge. Last but not least, the World Wide Web has opened the way to generous sharing of data and tools leading to faster knowledge creation through collaborative analysis in real and virtual workgroups.
+
<div><br /></div>
  
Nevertheless, Earth scientists and societal decision makers face significant hurdles. The production of Earth observations and models are rapidly outpacing the rate at which these observations are assimilated and metabolized into actionable knowledge that can produce societal benefits. The “data deluge” problem is especially acute for analysts interested in climates change and atmospheric processes are inherently complex, the numerous relevant data range form detailed surface-based chemical measurements to extensive satellite remote sensing and the integration of these requires the use of sophisticated models. As a consequence, Earth Observations (EO) are under-utilized in science and for making societal decisions.
+
===== Metadata Tools =====
  
=== Web Services===
+
This proposed work will provide tools developed in collaboration with EU INSPIRE (INSPIRE, 2009) and others to create ISO 19115 metadata records either through the transformation of one metadata standard to ISO or by expanding the OGC GetCapabilities document. The tools simplify the process of generating metadata by a semi-automatic process. Metadata in various formats such as, FGDC, DIF, etc have crosswalks to ISO 19115. Using the published crosswalks, the metadata will be re-mapped into ISO 19115 through a style sheet transformation.<br /><br /> The proposing team has developed a metadata tool that maps corresponding GetCapabilities metadata elements to ISO 19115 metadata elements and provides a user interface to manually complete the remaining ISO 19115 metadata elements for which the GetCapabilities do not contain information. The provider is also able to modify the metadata elements that were extracted from the GetCapabilities and then save the record as an ISO 19115 compliant file. Each metadata record will incorporate additional AQ-specific metadata needed for the AQ uFIND as well as content acquired during the downstream usage. Prior to acceptance in uFIND, each ISO metadata record is also validated through a web service ISO 19115 validator.<br />
A Web Service is a URL addressable resource that returns requested data, e.g. current weather or the map for a neighborhood. Web Services use standard web protocols: HTTP, XML, SOAP, WSDL allow computer to computer communication, regardless of their language or platform. Web Services are reusable components, like ‘LEGO blocks’, that allow agile development of richer applications with less effort. Visionaries (e.g. Berners-Lee, the ‘father’ of the Internet) argue that Web services can transform the web from a medium for viewing and downloading to distributed data/knowledge-exchange and computing.
 
  
Enabling Protocols of the Web Services architecture: '''Connect:''' Extensible Markup Language (XML) is the universal data format that makes data and metadata sharing possible. '''Communicate.''' Simple Object Access Protocol (SOAP) is the new W3C protocol for data communication, e.g. making and responding to requests. '''Describe.''' Web Service Description Language (WSDL) describes the functions, parameters and the returned results from a service. '''Discover.''' Universal Description, Discovery and Integration (UDDI) is a broad W3C effort for locating and understanding web services.
+
==== Data Finding ====
  
'''Service Oriented Architecture (SOA)''') provides methods for systems development and integration where systems package functionality as ''interoperable'' ''services''. SOA allows different applications to exchange data with one another. Service-orientation aims at a ''loose coupling'' of services with operating systems, programming languages and other technologies that underlie applications. These services communicate with each other by passing data from one service to another, or by coordinating an activity between two or more services. SOA can be seen in a continuum, from older concepts of distributed computing and modular programming, through to current practices of mashups, and Cloud Computing.
+
<div>Records are browsed using a customized, faceted search interface that was built to search the extended AQ metadata record set and find AQ data using specific filters such as sampling platform and data structure. Finding the right data is further enhanced by the user's ability to immediately view data as WMS through multiple clients provided by ESRI, Compusult and others. uFIND also allows query results to be embedded in another web page and also to link to multiple WMS viewers to browse layers in the catalog.<br /><br /> In uFIND, the AQ User can search for datasets through a faceted search which reacts to each step of the user's query. This dynamic interface ensures that the user can navigate by facets more familiar first, and make decisions about less familiar facets as the search is narrowed. The results returned will have additional metadata to aid decision-making like # of users that have used this datasets or links to places where it is used, so that the AQ-user is provided with some additional context.<br /><br /> Given that each AQ dataset is equipped with the standard data access interface, uFIND includes a Data Viewer. The user can browse data layers, compare them and further explore the data, ultimately choosing the most appropriate data for their application. </div><div><br /><font color="#000000"><span><font color="rgb(0, 0, 0)">The AQ uFIND will be equipped with Google Analytics, which tracks multiple dimensions of data query usage as metrics. The analytics information will be exposed to the users as well and information derived from the metrics such as most popular queries will aid users in how to start their queries. The monitoring of query popularity is possible because each query has a unique URL that has a certain number of page views.</font></span></font><font color="#000000"><font color="rgb(0, 0, 0)"> The Google Analytics will also identify how most users access a dataset, i.e. through Google Earth, WMS, WCS, netCDF, etc. </font></font><font color="#000000"><font color="rgb(0, 0, 0)">This feedback will help to hone the catalog to provide more useful information to users such as more datasets in a certain domain or tips for what others who viewed this dataset also viewed. The metrics will also help providers by identifying who some of the users of the data are and how they are finding a data product.</font></font><span><font color="rgb(0, 0, 0)"> </font></span></div>
  
 +
==== DataSpaces ====
  
[[Image:20090505 AIP2 ADC UICSlide3.PNG|400px]]<br>
+
<div>
There are numerous Earth Observations that are available and in principle useful for air quality applications such as informing the public and enforcing AQ standards. However, connecting a user to the right observations or models is accompanied by an array of hurdles.
 
  
The GEOSS Common Infrastructure allows the reuse of observations and models for multiple purposes 
+
Currently, metadata for air quality datasets is variable, distributed and normally created by the provider for the user. However, a single dataset can be used for many applications that the provider may or may not anticipate and the data may go through many value-adding processes before it reaches the "end user". Additional metadata can be created at any step along the usage chain and at this time there is no mechanism for collecting this metadata. Consequently, users don't know how a dataset has been used or what additional processing has occurred beyond the originator. One method to harvest and share metadata from all members of the usage chain is through community workspaces, DataSpaces. DataSpaces are virtual spaces for contributing and archiving metadata, discussing the dataset and harvesting distributed resources in order to capture the critical community knowledge about the dataset.
  
Even in the narrow application of Wildfire smoke, observations and models can be reused.
+
<br />
<br>
 
[[Image:20090505 AIP2 ADC UICSlide4.PNG|400px]]<br>
 
  
[[Image:20090505 AIP2 ADC UICSlide5.PNG|400px]]<br>
+
A DataSpace (Robinson, 2008) for a given dataset has two parts, structured, semantically rich metadata and flexible community-contributed metadata. The structured dataset description includes standard dataset metadata, data lineage, and data quality information such as provider, parameters, platform and time period. The additional value of the DataSpaces comes from the context provided by the dataset community: users, mediators and providers. This may be through links to other mediator or user-provided metadata, publications that reference the dataset or web applications and tools using the dataset. DataSpaces also provides a place where a dataset community can connect through discussion and announcements about the dataset. As DataSpaces evolves and is used more by the community, additional functionality will emerge. Currently, there are still many issues with the implementation of DataSpaces including how to link the DataSpace to the dataset as it moves along the usage chain and how material in DataSpaces can be reused in other metadata.<br />
The ADC and UIC are both participating stakeholders in the functioning of the GEOSS information system that overcomes these hurdles. The UIC is in position to formulate questions and the ADC can provide infrastructure that delivers the answers. <br>
 
  
[[Image:20090505 AIP2 ADC UICSlide6.PNG|400px]]<br>
+
<br /></div>
The data reuse is possible through the service oriented architecture of GEOSS.
 
  
* Service providers registers services in the GEOSS Clearinghouse.
+
== Connections to other Activities ==
* Users discover the needed service and access the data
 
  
The result is a dynamic binding mechanism for the construction of loosely-coupled work-flow applications.<br>
+
uFIND is an open system and its functionality depends on inputs from AQ users, data providers as well as value-adding contributors. Most connections of uFIND system can be executed through formal service interfaces. This open architecture will be the key mechanism for the scalable growth and evolution of uFIND System.<br /><br /> For example within ESIP, we anticipate symbiotic interaction with the Semantic Web Cluster. uFIND with the thousands of data layers can provide a rich resource and be a use case for ontology development and testing. In return, the developed ontologies and catalog services will help uFIND improve the user experience. Similarly, it is anticipated that collaboration with various ESIP groups pertaining data provenance will be pursued and the results incorporated into the uFind metadata. This will include a provenance chaining service.<br /><br /> Collaborations with other activities will build upon previous and ongoing collaborations that have developed as part of the project team's earlier projects. Through recent community building exercises, such as the Federation of Earth Science Information Partners (ESIP) Air Quality Workgroup, GEOSS Architecture Implementation Pilot, the collaborative participants needed for contributing to a successful Air Quality uFIND have been already developed and through this project will be focused on a particular capabilities and needs.<br /><br /> Participation in ESDSWGs is expected to include the Technology Infusion working group. Rudolf Husar and Stefan Falke will contribute time, in the sum of 0.25 FTE, to contribute to these working groups. During a previous NASA REASoN and AIST projects Stefan Falke and Rudy Husar contributed to to the tech infusion, sensor web and reuse working groups. Based on previous experience, we anticipate that this ACCESS project would provide the most suitable contributions to the Technology Infusion Working Groups but will work with NASA to scope our participation.<br /><br /> It is worth noting that a similar requirement in an earlier NASA REASoN CAN compelled us to become involved in the Earth Science Information Partners (ESIP) and create and coordinate the ESIP Air Quality Workgroup which has been and ongoing collaboration forum over four years, with particularly active and broad participation over the last two years. We anticipate this ACCESS proposal to help continue the coordination and growth of the ESIP Air Quality Workgroup. The ESIP Air Quality Workgroup is a crucial partner in the proposed effort. In fact, we initially submitted this proposal's Notice of Intent with ESIP listed as the PI. However, it was determined that ESIP serving as PI introduced some conflicts with ESIP policies and by-laws and ESIP is now represented in the project as a coordination and collaboration environment in which to develop, test, and refine uFIND.<br /><br />
  
[[Image:20090505 AIP2 ADC UICSlide7.PNG|400px]]<br>
+
<div class="MsoNormal">The AQ uFIND will be registered as a component in the GEOSS Component and Service Registry (CSR). The GEOSS Clearinghouses query the GEOSS CSR for catalogs and then periodically harvest the catalogs for their metadata records. From the point of view of the GEOSS Common Infrastructure, this particular functionality makes uFIND an Air Quality community catalog. </div>
The metadata has the primary purpose to facilitate finding and accessing the data in order to help dealing with first two hurdles that the users face. Clearly, the air quality specific metadata such as sampling platform, data domain and measured parameters etc. need to be defined by air quality users. Dealing with the hurdles of data quality and multi-sensory data integration are topics of future efforts. <br>
 
  
 +
=== Extensions of Past Work and Impact of this Work ===
  
The finding of air quality data is accomplished in two stages.
+
The proposed work is an extension of past research efforts conducted since 2001 on the development of the federated data system, DataFed (DataFedwiki; Husar, 2007; Husar, 2006; Husar, 2005). The DataFed development was supported by NSF, EPA as well as through the 5-year NASA REASoN grant, 2004-2009, "Application of ESE DATA and Tools to Particulate Air Quality Management." Over 100 standards-based datasets for air pollution data mediated through DataFed have been accessed by over a thousand repeat users from throughout the world (Fig.5).<br />[[Image:DataFed_Analytics.png|400px]]<br>
* the data are filtered through the generic discovery mechanism of the clearinghouse
+
<font size="1"> Figure 5. Location of top 100 users from Jan 1, 2007-June 20, 2009. Captured with Google Analytics</font><br /><br /> DataFed was also the data access system for the development and subsequent application of EPA's Exceptional Event Rule. The DataFed services have provided the key data source for many interoperability experiments for data processing and other workflow applications conducted through ESIP and GEOSS. Most recently, DataFed was a key contributor as part of the GEOSS Architecture Implementation Pilot (AIP)-II discussed in the Workplan section of this proposal.<font size="3"> </font>
* then air quality specific filters such as sampling platform and data structure are applied
 
  
[[Image:20090505 AIP2 ADC UICSlide9.PNG|400px]]<br>
+
=== Relevance to NASA Programs ===
Once the data are accessible through standard service protocols and discoverable through the clearinghouse they can be incorporated and browsed in any application including the ESRI and Compusult  GEO Portals. <br>
 
  
[[Image:20090505 AIP2 ADC UICSlide10.PNG|400px]]<br>
+
<div></div><div>In meeting its goal to, "Study Earth from space to advance scientific understanding and meet societal needs", the 2006 NASA Strategic Plan emphasizes that, "as new types of Earth observations become available, information systems, modeling, and partnerships to enable full use of the data for scientific research and timely decision support will become increasingly important." The proposed Air Quality uFIND is a tool to the ability for NASA and its data users to use existing and new data products and to develop partnerships to increase the usefulness of NASA data products in air quality decision support. </div><div></div><div>The Decadal Survey published by the National Academies of Science outlines objectives for future satellite earth observation missions for NASA and NOAA. It also stresses the importance of adequate information systems in order to make use of those new observations by concluding that, �fundamental improvements are needed in existing observation and information systems because they only loosely connect three key elements: (1) the raw observations that produce information; (2) the analyses, forecasts, and models that provide timely and coherent syntheses of otherwise disparate information; and (3) the decision processes that use those analyses and forecasts to produce actions with direct societal benefits.� The proposed project helps couple these three elements more closely by making it easier for earth observations users who create analyses and forecasts to find and access earth observations, thereby improving their ability to support decision processes.<br /><br /> The Air Quality uFIND is a demonstration of a modern, service-oriented architecture and the application of a small set of interoperability technologies that allows the creation of distributed, but interoperable data systems. While the particular application is for the domain of air pollution and atmospheric composition, the architecture, the technologies as well as the tools and methods are directly applicable to other science domains of interest to NASA such as climate change.<br /><br /> The Air Quality uFIND constitutes a conduit through which datasets registered in GCMD and other data directories can be delivered closer and easier to the air quality science and management community. Furthermore, the direct, two-way network link between uFIND and NASA data portals such as Giovanni constitute convincing demonstration of the viability of service-oriented, data networking. </div>
The registered datasets are also directly accessible to air quality specific, work-flow based clients which can perform value-adding data processing and analysis.
 
  
The loose coupling between the growing data pool in GEOSS and workflow-based air quality client software shows the benefits of the Service Oriented Architecture to the Air Quality and Health Societal Benefit Area.  <br>  
+
<font size="3"> </font>
[[Image:20090505 AIP2 ADC UICSlide11.PNG|400px]]<br>
 
  
==== The Network ====
+
=== Relevance to NRA Objectives ===
  
* Fan-In, Fan-Out
+
The uFIND infrastructure and the associated tools and methods constitutes a direct contribution to the ACCESS objective: to develop a "means for users to discover and use services being made available by NASA, other Federal agencies, academia, the private sector and others." Assessing the total life-cycle cost of this development is difficult since it's development has been pursued for about a decade and the process of service oriented data sharing will continuing the development well beyond this two year project. However, this is an important phase in the evolution of data networking because the applications have reached levels of TRL 7 or higher. This means that data from distributed providers can now be reliably and persistently found and accessed. In fact, we anticipate that at the end of the project, uFIND will incorporate several thousand air quality data layers that can be queried by sharp filters and immediately incorporated into browsers and processing applications.<br /><br /> The design of uFIND is initially targeted for the AQ-relevant datasets from NASA and the over hundred datasets already registered in DataFed as services. However, discussions are in progress for the implementation of a uFIND node as part of EPA's Exchange Network. Conceivably, additional uFIND nodes could be established at NOAA as well as at international agencies in Europe and Asia. These additional future activities are anticipated to be conducted in the architectural framework of GEOSS and uFIND nodes would constitute the components of a distributed Air Quality Community Catalog. Hence the life-cycle cost of the system will be the combined contributions for the system development and maintenance of the investments integrated over the time period of its evolution as well as over the multiple contributions by the community of its participants.<br /><br /> uFIND is not appropriate for all datasets. The design of uFIND is inherently targeting datasets that are expected to have the quality and value such that multiple applications can benefit from its use. Creating extensive metadata and provision of standard data access interface may be inapproriate for datasets that are of limited applicability for re-use. uFIND is also inappropriate for complicated datasets such as aircraft sampling data. Furthermore, uFIND is not suitable for the delivery of raw sampling calibration datasets.<br /><br /><font size="3"> </font>
* (so is GCI) not central
 
* holarchy , data up into the pool though the aggregator network and down the disaggregator/filter network
 
[[Image:ScaleFreeNetwork3.png|300px]]
 
* Data distributed through Scale-free aggregation network. Metadata contributed along the line of usage. Homogenized and shared.
 
  
 +
== General Work Plan ==
  
 +
<div><font size="2">The basic workplan for the proposed 2-year project consists of the following activities:<br /></font></div><div>'''Year 1:''' Focus on establishing connections to uFIND system components and their interactions. Increased connections to data providers. </div>
  
This proposal...application of the GEOSS concepts in the federated data system, DataFed. The proposal focuses on the SAO aspects of the publish find bind. ...a contribution to the emerging architecture of GEOSS. It is recognized that it represents just one of the many configurations that is consistent with the loosely defind concept of GEOSS.
+
* <font size="2">Survey of data directories and data hubs for relevant AQ data</font>
 +
* <font size="2">Develop community process for selecting air quality datasets</font>
 +
* <font size="3"><div><font size="2">Continue the testing and improvement of WCS-netCDF data wrapper </font></div></font>
 +
* <font size="3"><div><font size="2">Facilitating WCS interface to selected datasets </font></div></font>
 +
* <font size="3"><div><font size="2">Continue to develop the metadata mapping tool for easy creation of AQ metadata records</font></div></font>
 +
* <font size="3"><div><font size="2">Facilitate registration of AQ datasets in uFIND</font></div></font>
 +
* <font size="3"><div><font size="2">Implementation of DataSpaces for key datasets </font></div></font>
  
 +
<font size="3"> </font>
  
The implementation details and the various applications of DataFed are reported elsewhere [4]-[6].
+
<font size="2">'''Outcome of year 1:''' AQ-relevant datasets that can be found through sharp queries of the metadata and accessed through OGC Standard Data Access Services<br /></font>
  
 +
<div></div><div><font size="2">'''Year 2'''</font><font size="2">'''<nowiki>:</nowiki>''' Focus on testing and refining uFIND</font></div>
  
 +
* <div><font size="3"><font size="2">Users test the data access system, based on the feedback flaws in the access services are eliminated.</font></font></div>
  
-----
+
* <div><font size="3"><font size="2">Users test uFIND for usability flaws in metadata or connectivity modified</font></font></div>
 +
* <div><font size="3"><font size="2">Connection and testing with uFIND and DataSpaces for compatibility and utility</font></font></div>
  
Data Value Chain Stages: Acquisition - Mediation - Application
+
<div>'''Outcome of year 2:''' a well-tested and robust uFIND consisting of thousands of AQ-relevant data layers </div><div></div><div>It is anticipated that throughout this project collaboration would occur thorugh the GEOSS ADC AIP-III, ESIP and EPA's Exchange Network.</div>
* '''Acquisition:''' Data from Sensor -> CalVal -> Data exposed
 
* '''Mediation:'''  Accessible/Reusable -> Leverable
 
* '''Application:''' Processed -> LeveragedSynergy -> Productivity
 
  
=== Provider and User Oriented Designs ===
+
==== Current State of Application ====
* Providers offers it wares ... to reach maxinum users in many applications
 
*
 
[[Image:GEOSS_Fanin_Fanout.png |600px]] <br>
 
[[Image:GEOSSUIC_Diagram.png |400px]] <br>
 
[[Image:PublishFindBind.png | 600px]] <br>
 
[[Image:ExtendedArch.png]]
 
  
==Community==
+
<div>The initial demonstration of uFIND was prepared as part of the GEOSS Architecture Implementation Pilot (AIP-II) in 2008-2009. During that pilot, the components shown in Figure 4 were functional and connected to the GEOSS Clearinghouse (GEOSS Clearinghouse FGDC, 2009). In fact, uFIND served as the Air Quality Community Catalog that was routinely harvested by the clearinghouse. The metadata for the air quality data records has been developed, such that faceted search on about 500 data records could be executed. During AIP-II the data access services were confined to OGC WMS images. WCS data services from providers other than DataFed were not available.<br /></div>
  
+
<br />
To illustrate the Network, Coding (faceting) through metadata, WMS to show data
 
** WCS next
 
* Binding to data through standard data access protocols, publishing and finding requires metadata system
 
* google analytic sensors - where do we put it? so that we can identify users pattern.
 
  
==== Community ====
+
As part of this project the uFIND pilot implementation will be expanded in a number of ways such that at the end of the project uFIND will be a complete and robust package of services and tools, with graphic user interfaces. The specific developments in year 1 will include the connection to data directories and user-driven selection of several thousand data layers offered by distributed providers. A major effort will be invested in helping data providers to expose their data through the WCS standard interface. The WCS-netCDF data wrapper will undergo considerable testing and further development of functionality and usability. The metadata preparation tools will be expanded to accommodate additional facets for searching and also links to additional metadata such as data provenance. The uFIND search engine will be expanded for a broader range of facets and also for alternative user interfaces. The substantial development will consist of linking uFIND data records to data browsers such as Google Earth for spatial views and time charts for temporal data views. The interfaces of uFIND will also be streamlined to improve the incorporation of uFIND outputs to other client applications such as portals.<br />
  
=== Data Discovery ===
+
==== Management Approach ====
* [http://sesdi.hao.ucar.edu/docs/IN24A-05_Fox_SESDI_usecase.pdf Semantic Mediation] - Repackaging/homogenizing metadata - added value comes from incorporating user actions back into the semantic relationships.
 
* Metadata system for publishing and finding content has to be jointly developed between data providers and users. 
 
* Generic catalog systems  - metadata collection of not only what provider has done but also tracking what users need
 
** Collecting and Enhancing Metadata from observing Users
 
* Communication along the value chain, in both direction;
 
* Metadata the glue and the message
 
* Market approach; many providers; many users; may products
 
  
* Faceted search
+
The major part of the proposed project will be performed by the core group of investigators listed in the table below. Professor Husar is the Principal Investigator and the architect of the federated data system, DataFed. He has long-term interest in environmental informatics with particular emphasis on tools for data access and exploration. Professor Falke is the Co-Investigator and expert in geospatial informatics. He has also been a community leader for the ESIP Air Quality Workgroup and other community activities. Erin Robinson is a doctoral candidate in the School of Engineering and has years of experience in environmental data analysis, and tools and technologies to assist collaborative research. Participation in this project will consititute a significant contribution to her Ph.D. thesis. Kari Hoijarvi is the chief software designer, programmer of the major software systems developed at CAPITA since the 1990s. His deep understanding of Service Oriented Programming and data processing will be key to the development of uFIND. Ed Fialkowski is a software developer with experience in networking, portal development and other applications development.<br /><br />
** user is happy
 
* Search by usage data
 
  
Description on how users will discover and use services provided by NASA, other Agencies, academia..
+
<div><div>
* Detail on discovery services
 
* System components for persistent availability of these services
 
** machine-to-machine interface
 
** GUI interface
 
  
Classes of Users
+
{| id="wy9o" width="648" border="1" cellpadding="3"
* by value chain
+
| width="33%" | '''Name '''
* by level of experience
+
| width="33%" | '''Organization '''
* by ...
+
| width="33%" | '''Role/Contribution'''
 +
|-
 +
| width="33%" | Rudolf Husar
 +
| width="33%" | Washington Univ.
 +
| width="33%" | PI, uFIND Architect
 +
|-
 +
| width="33%" | Stefan Falke
 +
| width="33%" | Washington Univ.
 +
| width="33%" | Co-I, ESDWG, DataSpaces Connection
 +
|-
 +
| width="33%" | Erin Robinson
 +
| width="33%" | Washington Univ.
 +
| width="33%" | Graduate Student, uFIND Metadata
 +
|-
 +
| width="33%" | Kari Hoijarvi
 +
| width="33%" | Consultant
 +
| width="33%" | Finder Developer, WCS Server Developer
 +
|-
 +
| width="33%" | Ed Fialkowski
 +
| width="33%" | Washington Univ.
 +
| width="33%" | Metadata Tools Developer
 +
|}
  
=== Data Assess and Usage ===
+
</div></div><div></div><div>The proposed project will rely heavily on the community contributions of the ESIP Air Quality Workgroup, the GEOSS Architecture Implementation Pilot Community and other communities such as the group associated with EPA's Air Quality Data Summit.<br /><br /> As in the past projects, this work will be performed as an open process with active participation of interested members from these communities. The community support will include design guidance, technology contributions and most importantly in establishment of service connections between uFIND and upstream and downstream services.<br /></div>
Provider Oriented Catalog:
 
  
* All data from a provider (subset fall  data)
+
<font size="3"> </font>
* Metadata only (Standard protocol)
 
* Provider metadata (meta-meta-meta  data)
 
  
 +
== Data Sharing Plan<br /> ==
  
User Oriented Catalog:
+
<div></div><div>The entire basis of the Air Quality uFIND is data sharing, so the work plan outlined above is, in essence, our data sharing plan. However, we highlight certain data sharing aspects of the proposed approach here including those aspects that help other projects and programs enhance their data sharing capabilities. The data reuse will be enhanced through the service oriented architecture. The registered datasets are also directly accessible to air quality specific, work-flow based clients which can perform value-adding data processing and analysis.<br /><br /> Sharing research results with communities is vital to the continued development of uFIND and the infrastructures in which it is used. We will expose our results at ESTO technology conferences, AGU meetings and other forums conducive to new interactions with data providers and users. We will gain additional interaction with the earth science community through our continued leadership in the Earth Science Information Partners Federation (ESIP) and through GEO-related activities. </div>
  
* The right data to the user at the right time the right  (subset fall data)
+
<font size="3"> </font>
* Seamlessly accessable (Standard protocol)
 
* Complete Metadata (meta-meta-meta  data)
 
  
 +
== Operations Concept ==
  
* Has to handle derived data (Raw-procssed Pyramid --- less along the value chain-network)
+
In contrast to a stand-alone project that relies on its own individual initiatives to achieve persistence or sustainability, the Air Quality uFIND is expected to be closely coupled with other, larger efforts that provide opportunities for continued operation by various domain communities. For example, GEOSS is gradually being developed by assembling and developing components of the overall system and bringing them together. While a long range operations plan is yet to be defined, the expectation is that the successful development will lead to a sustainable infrastructure. The Air Quality uFIND is aimed at providing a critical component to the use of GEOSS by the air quality community and as a result, as the GEOSS operations plan is developed, uFIND will be part of that. A suggested approach for the continued operation of uFIND is through collaborations and resource sharing. Operations will be co-supported by others that would adapt it - in the U.S. EPA's Exchange Network.
  
== Performance Measurement and Feedback==
+
<div></div>
=== Metadata from Providers ===
+
----
Active contributions:
 
* Provide discovery, access information
 
* Providers can also provide information about how users behave once they are at their site
 
  
Analytics contributions:
+
= <font size="4">References</font> =
* Providers could expose monitoring data about usage on their site in order to provide information about who uses the data, where, when...
 
* Mediators could aggregate monitoring data from multiple providers of the same data
 
  
=== Metadata from Users ===
+
<div><font size="2"><div style="margin-left: 0.5in">
Analytics:
 
* Google Analytics/Google Sitemap - provides feedback and helps market process by improving "shopping" experience for users - creates values to both users and producers.
 
* Amazon - collects data on user actions in order to help the next user navigate to books of interest. collects data on text in the book in order to relate books together
 
* [http://wiki.earlyimpact.com/widgets/recently_viewed_products_widget Recently Viewed widget] Tracks for the user the last things that they viewed
 
* The 'What users do next' would allow a capture of
 
** other datasets viewed at the same time
 
** tools that are used with this dataset - based on the analytics  monitoring tool/data combinations you could provide information on "this data is most commonly used with this tool"
 
  
 +
<font size="1">Air Quality and Health Working Group (GeossPilot2). Available at: http://sites.google.com/site/geosspilot2/air-quality-and-health-working-group [Accessed June 26, 2009].</font>
  
User active contributions <br>  
+
<font size="1">Air Quality Work Group - Federation of Earth Science Information Partners. Available at: http://wiki.esipfed.org/index.php/Air_Quality_Work_Group [Accessed June 26, 2009]. </font>
Tags:
 
* Users can tag based on which project the data is used in, event, ...
 
* Users could login and select favorites or tag in their own way this would allow navigation of data by users
 
** The benefit gained by logging in, is that the catalog would be personalized
 
** The benefit others gain is an additional relationship of datasets that only a human could know.
 
** Logging in would also allow the identification of 'data experts' - who are people that have contributed a lot about this dataset
 
* Dataset popularity could be shown like in delicious - x# of people have tagged this.  
 
* Feeds of particular tag query can be fed to different sources
 
* This extends the "what links here" that could be embedded in the dataspace page. (youtube and sitemap both track what links here)
 
* [http://amapedia.amazon.com/ Amapedia] - Amazon "DataSpaces" - lets users give structured tags "facts" that allow additional navigation in novel ways.
 
Reviews
 
* Users can offer "reviews" of the data - feedback on problems, advertise where they use it, questions about the data
 
* Users can offer help docs, papers or other information they have on the data (also adding to the expertise of a particular contributor)
 
  
=== Metadata from Mediators ===
+
<font size="1">Cleveland, Harland, Information as a Resource, Futurist, 16, 34-39, 1982. </font>
* [http://docs.amazonwebservices.com/AlexaWebInfoService/2005-07-11/ Alexa Amazon Web Information Service]provides analytics about another site. This is good for us b/c we could access analytics about distributed data access and show them uniformly in catalog (Site Overview pages, Traffic Detail pages and Related Links pages)
 
** Alexa also example of site pulling in information about one URL from multiple sources. ([http://www.alexa.com/siteinfo/amazon.com Amazon example])
 
*  perform collaborative filtering based upon data collected from more than one [data provider]
 
* [http://www.kaushik.net/avinash/ Avinash Kaushik] - Evangelist for Google Analytics
 
* Key Performance Indicators -
 
** Where do people come from? search, other links - % of visits across all traffic source
 
** Bounce rate - Came to one page and left immediately  (Good for us b/c it helps direct people to the right information/right place/right time)
 
*** Combining where people come from and how many people from that source bounce lets you know if you are targeting the right people...
 
*** Wrong audience
 
*** Wrong Landing page
 
** Visitor loyalty - # that come in a give duration
 
** Recency of visit - do you retain people over time
 
* What are the key outcomes you want people to do (i.e. subscribe to feed, click on data link, click on metadata link...)
 
* What is the top content on site?
 
  
==Management==
+
<font size="1">Data Summit Workspace - Federation of Earth Science Information Partners. Available at: http://wiki.esipfed.org/index.php/Data_Summit_Workspace [Accessed June 26, 2009]. </font>
Community approach:
 
* ESIP AQ Workgroup with links to
 
** GEO CoP -> Linking multi-region (global), multi SBA
 
* Agency (Air Quality Information Partnership (AQIP))
 
 
 
** EPA
 
** NASA
 
** NOAA
 
** DOE ....
 
* ESIP
 
** Semantic Cluster
 
*** Offer: a rich highly textured data needing semantics
 
*** Needed:  Semantics of the data descriptions and finding
 
  
** Web Services and Orchestration  Cluster
+
<font size="1">Evidence for Flagging Exceptional Events - Federation of Earth Science Information Partners. Available at: http://wiki.esipfed.org/index.php/Evidence_for_Flagging_Exceptional_Events [Accessed June 26, 2009]. </font>
*** Offer: A rich array of WCS data access services
 
*** Different workflow & orchestration clients
 
  
** Meetings
+
<font size="1">GEO Task US-09-01a Home Page. Available at: http://sbageotask.larc.nasa.gov/ [Accessed June 26, 2009]. </font>
*** Winter
 
*** Summer
 
** Telecons
 
***
 
  
** OGC WCS netCDF
+
<font size="1">GEO User Requirements for Air Quality - Federation of Earth Science Information Partners. Available at: http://wiki.esipfed.org/index.php/GEO_User_Requirements_for_Air_Quality [Accessed June 26, 2009]. </font>
*** Stefano
 
*** Ben
 
*** Max Cugliano
 
  
* Other Related Proposals/Projects
+
<font size="1">GEOSS-CLEARINGHOUSE: Common Search Facility. Available at: </font>[http://clearinghouse.awcubed.com/perl-bin/ch_query <font size="1">http://clearinghouse.awcubed.com/perl-bin/ch_query</font>]<font size="1"> [Accessed June 26, 2009].</font>
** Show our CC proposal to AQWG
 
*** Ask the if they have a way to use this CC as testbed
 
*** Add a paragraph into the proposal to indicate the way their fits in
 
  
== links ==
+
<font size="1">Husar, R., Falke and K. Hoijarvi: Interoperability of Web Service-Based Data Access and Processing: Experience Using the DataFed System. ESTO Meeting, 2006. ''Paper A6P2''. <span class="Z3988" title="url_ver=Z39.88-2004&ctx_ver=Z39.88-2004&rft_val_fmt=info2Ffmt3Amtx%3Ajournal&rft.genre=article&rft.atitle=Falke20K.3A20of20Service-Based20Access20Processing20Experience20the20System.20Meeting202006&rft.jtitle=Paper%20A6P2&rft.aufirst=R.&rft.aulast=Husar&rft.au=R.%20Husar"> </span> </font>
ESIP
 
  
* [http://wiki.esipfed.org/index.php/Data_System_Governance_and_Sustainability Data_System_Governance_and_Sustainability and Synery]
+
<font size="1">Husar, R. & Poirot, R., 2005. DataFed and FASTNET: Tools for agile air quality analysis. ''EM-PITTSBURGH-AIR AND WASTE MANAGEMENT ASSOCIATION-'', 39. <span class="Z3988" title="url_ver=Z39.88-2004&ctx_ver=Z39.88-2004&rft_val_fmt=info2Ffmt3Amtx%3Ajournal&rft.genre=article&rft.atitle=DataFed20FASTNET20Tools20agile20quality%20analysis&rft.jtitle=EM-PITTSBURGH-AIR20WASTE20ASSOCIATION-&rft.aufirst=R.&rft.aulast=Husar&rft.au=R.%20Husar&rft.au=R.%20Poirot&rft.date=2005&rft.pages=39"> </span> </font>
* [http://wiki.esipfed.org/index.php/Interoperability_of_Air_Quality_Data_Systems Interoperability_of_Air_Quality_Data_Systems]
 
* [http://wiki.esipfed.org/index.php/Data_Systems_Architecture Data_Systems_Architecture]
 
* [http://wiki.esipfed.org/index.php/Desired_Characteristics_of_Air_Quality_Data_Systems Desired_Characteristics_of_Air_Quality_Data_Systems]
 
* [http://wiki.esipfed.org/index.php/Air_Quality_Data_Processing_and_Portal_Systems Air_Quality_Data_Processing_and_Portal_Systems]
 
* [http://wiki.esipfed.org/index.php/Air_Quality_Data_Providers Air_Quality_Data_Providers]
 
* [http://wiki.esipfed.org/index.php/AQ_Data_System_Clients_and_User_Groups AQ_Data_System_Applications Clients_and_User_Groups]
 
* [http://wiki.esipfed.org/images/9/95/Community_Air_Quality_Data_Systems_Strategy.doc Community_Air_Quality_Data_Systems_Strategy]
 
* [http://wiki.esipfed.org/index.php/AQ_Data_System_Strategy_Introduction_and_Purpose AQ_Data_System_Strategy_Introduction_and_Purpose]
 
  
 +
<font size="1">Husar, R.B. & Hoijarvi, K., 2007. DataFed: Mediated web services for distributed air quality data access and processing. In ''IEEE International Geoscience and Remote Sensing Symposium, 2007. IGARSS 2007''. pp. 4016-4020. </font>
  
DataFed wiki
+
<font size="1">Husar, R.B. et al., 2008. DataFed: An Architecture for Federating Atmospheric Data for GEOSS. ''IEEE Systems Journal'', 2(3), 366-373. <span class="Z3988" title="url_ver=Z39.88-2004&ctx_ver=Z39.88-2004&rft_val_fmt=info2Ffmt3Amtx%3Ajournal&rft.genre=article&rft.atitle=DataFed20An20for20Atmospheric20for%20GEOSS&rft.jtitle=IEEE20Journal&rft.volume=2&rft.issue=3&rft.aufirst=R.%20B.&rft.aulast=Husar&rft.au=R.20Husar&rft.au=K.%20Hoijarvi&rft.au=S.20Falke&rft.au=E.20Robinson&rft.au=G.20Percivall&rft.date=2008&rft.pages=366-373"> </span> </font>
  
* [http://datafedwiki.wustl.edu/index.php/2006-01-11_Data_Flow_&_Interoperability_in_DataFed_Service-based_AQ_Analysis_System Interoperability_in_DataFed_Service-based_AQ_Analysis_System]
+
<font size="1">INSPIRE geoportal. Available at: </font>[http://www.inspire-geoportal.eu/ <font size="1">http://www.inspire-geoportal.eu/</font>]<font size="1"> [Accessed June 26, 2009].<br /></font>
* [http://datafedwiki.wustl.edu/index.php/2008-07-07_EPA_Interoperability 2008-07-07_EPA_Interoperability]
 
* [http://datafedwiki.wustl.edu/images/4/44/Husar060126_HTAP_Geneva.ppt HTAP Interoperability Husar060126_HTAP_Geneva]
 
* [http://datafedwiki.wustl.edu/index.php/2008-02-11:_EPA_DataFed_Presentation 2008-02-11:_EPA_DataFed_Presentation]
 
* [http://datafedwiki.wustl.edu/index.php/Collaboration DataFed Collaboration Projects]
 
* [http://datafedwiki.wustl.edu/index.php/2008-02-29:_Data_&_Knowledge_Differences_on_Interop_Stack 2008-02-29:_Data_&_Knowledge_Differences_on_Interop_Stack]
 
* [http://datafedwiki.wustl.edu/index.php/2006-06-28_ESTO,_College_Park 2006-06-28_ESTO,_College_Park]
 
  
NASA Existing Component - Links
+
<font size="1">Research, P.O.I.T.A.T.C.O. et al., 1989. ''Information Technology and the Conduct of Research: The User's View'', National Academy Press.</font>
* [http://nasadaacs.eos.nasa.gov/pdf/EOSDIS_Atmosphere_Dec2008.pdf Atmosphere Data Reference Sheet] - Datasets identified to be relevant to atmospheric research.
+
 
* [http://gdata1.sci.gsfc.nasa.gov/daac-bin/G3/gui.cgi?instance_id=atrain Giovanni] -  that provides a simple and intuitive way to visualize, analyze, and access vast amounts of Earth science remote sensing data without having to download the data [http://disc.sci.gsfc.nasa.gov/giovanni/additional/users-manual/G3_manual_parameter_appendix.shtml#AOD55 GIOVANNI metadata] describes briefly parameter
+
<font size="1">Robinson, E.M. & Husar, R.B., 2008. DataSpaces: Using Community Workspaces to Enable Rich Air Quality Metadata. In ''American Geophysical Union, Fall Meeting 2008, abstract# IN22A-06''.</font>
* [http://disc.sci.gsfc.nasa.gov/PIP/ parameter information pages] - provide short descriptions of important geophysical parameters; information about the satellites and sensors which acquire data relevant to these parameters; links to GES DAAC datasets which contain these parameters; and external data source links where data or information relevant to these parameters can be found.
+
 
* [http://mirador.gsfc.nasa.gov/cgi-bin/mirador/help.pl?helppage=overview.shtml&helpmenuclass=overview&SearchButton=Search%20GES-DISC Mirador] new search and order Web interface employs the Google mini appliance for metadata keyword searches. Other features include quick response, data file hit estimator, Gazetteer (geographic search by feature name capability), event search
+
<font size="1">Taylor, R.S. & Voigt, M.J., 1986. ''Value added processes in information systems'', Greenwood Publishing Group Inc. Westport, CT, USA. <span class="Z3988" title="url_ver=Z39.88-2004&ctx_ver=Z39.88-2004&rft_val_fmt=info2Ffmt3Amtx%3Abook&rft.genre=book&rft.btitle=Value20processes20information%20systems&rft.publisher=Greenwood20Group20Westport20CT20USA&rft.aufirst=R.%20S.&rft.aulast=Taylor&rft.au=R.20Taylor&rft.au=M.20Voigt&rft.date=1986"> </span> </font>
* Frosty
+
 
* [http://gcmd.nasa.gov/KeywordSearch/Metadata.do?Portal=GCMD&KeywordPath=Parameters|ATMOSPHERE|AIR+QUALITY|SULFUR+OXIDES&OrigMetadataNode=GCMD&EntryId=GES_DISC_OMSO2G_V003&MetadataView=Full&MetadataType=0&lbnode=mdlb2 GCMD] - Has selected Air Quality datasets; provides lots of discovery metadata keywords, citation, etc. lacks standard data access.  
+
<font size="1">uFIND Pilot. Available at: http://webapps.datafed.net/geoss_catalog.aspx [Accessed June 26, 2009].</font>
* [http://disc.sci.gsfc.nasa.gov/atdd A-Train Data Depot] -  to process, archive, allow access to, visualize, analyze and correlate distributed atmospheric measurements from A-Train instruments.
+
 
* [http://disc.sci.gsfc.nasa.gov/acdisc/overview.shtml Atmospheric Composition Data and Information Service Center] -  is a portal to the Atmospheric Composition (AC) specific, user driven, multi-sensor, on-line, easy access archive and distribution system employing data analysis and visualization, data mining, and other user requested techniques for the better science data usage.
+
<div>
* [https://wist.echo.nasa.gov/api/ WIST] -  Warehouse Inventory Search Tool. search-and-order tool is the primary access point to 2,100 EOSDIS and other Earth science data sets
+
----
* [http://mercury.ornl.gov/esip/ FIND] - The FIND Web-based system enables users to locate data and information held by members of the Federation (DAACs are Type 1 ESIPs.) FIND incorporates EOSDIS data available from the DAAC Alliance data centers as well as data from other Federation members, including government agencies, universities, nonprofit organizations, and businesses.
+
</div>
* [http://sesdi.hao.ucar.edu/intro.php SESDI Semantically-Enabled Science Data Integration] | [http://sesdi.hao.ucar.edu/vision.php Vision] - ACCESS Project, Peter Fox - will demonstrate how ontologies implemented within existing distributed technology frameworks will provide essential, re-useable, and robust, support for an evolution to science measurement processing systems (or frameworks) as well as for data and information systems (or framework) support for NASA Science Focus Areas and Applications.
 

Latest revision as of 12:01, June 29, 2009

Air Quality Cluster > AQIP Main Page > Proposal | NASA ACCESS Solicitation | Context | Resources | Forum | Participants
uFIND: User-oriented Tool Set for Air Quality Data Discovery and Access

The objective of this work is connect air quality users to air quality-relevant data services at NASA and elsewhere through the implementation of a set of well-tested and effective services, tools and methods. The centerpiece of the proposed project is the web service-based tool set: user-oriented Filtering and Identification of Networked Data (uFIND). The purpose of uFIND is to provide rich and powerful facilities for the user to: discover and choose a desired dataset by navigation through the multi-dimensional metadata space using faceted search, seamlessly access and browse datasets and use uFINDs facilities as a web service for the mashups with other AQ applications and portals. Datasets found through uFIND will be accessible through standard OGC WCS, WMS data access protocols. A major perceived contribution of the proposed work is an infrastructure for the harvesting, harmonization and continuous augmentation of metadata through the metadata flow system. The metadata will follow the ISO 19115 standard for geospatial metadata. The Service Oriented Architecture of uFIND will allow service-based interfacing with providers and users of the metadata. uFIND will be applicable to other Earth Science domains and be instrumental in the emerging Global Observing System of Systems (GEOSS) and in particular in the development of the GEOSS Common Infrastructure (GCI)


Scientific/Technical/Management Section

Objectives

This proposal is in response to the solicitation: Advancing Collaborative Connections for Earth System Science (ACCESS) 2009. In particular, it focuses on providing "means for users to discover and use services being made available by NASA, other Federal agencies, academia, the private sector, and others". The objective of this work is to implement a set of well-tested and effective services, tools and methods for the discovery and seamless access to air quality-related datasets.

Background

Recent developments offer outstanding opportunities to fulfill the information needs for Earth Sciences and support for many societal benefit areas. The satellite sensing revolution of the 1990's now yield near-real-time observations of many atmospheric parameters. The data from surface-based monitoring networks now routinely provide detailed characterisation of atmospheric and surface parameters. The �terabytes� of data from these surface and remote sensors can now be stored, processed and delivered in near-real time and the instantaneous �horizontal� diffusion of information via the Internet now permits, in principle, the delivery of the right information to the right people at the right place and time. Nevertheless, atmospheric scientists and air quality decision makers face significant hurdles. The production of atmospheric observations and models are rapidly outpacing the rate at which these observations are assimilated and metabolized into actionable knowledge that can produce scientific understanding and societal benefits. The �data deluge� problem is especially acute for atmospheric scientists interested in the use of satellite observations. As a consequence, most atmospheric observations are significantly under-utilized.
In the U.S., virtually all the important Earth Science datasets including air quality-relevant observations are now publicly accessible through the Internet. Typically, data providers place the numeric datasets onto a web server along with the associated 'Readme' files, which contains descriptive information about the data, access instructions and other metadata. Over the past decades, there was also a steady evolution of data directories where individual datasets could be registered, labeled and searched by potential users. The outstanding example of a long-term cataloging effort is NASA's Global Change Master Directory (GCMD) which illustrates the data deluge problems. A search of GCMD for "atmosphere" shows about 8000 entries for datasets and 1000 entries for services. Even an air quality parameter "aerosol" returns about 1000 datasets and 500 services. The directories of other agencies and nations probably contains at least this many entries.

In GCMD user support for finding a dataset is provided primarily through a controlled vocabulary of keywords along with free text search of the DIF-formatted standard metadata. Providers of Earth Observations can easily publish the data in GCMD and they only have to do it once. On the other hand, finding and accessing and using any given dataset takes a much larger effort and it is repeated individually by every user of a given dataset. Collectively, the users of a dataset may expend 100 or 1000 times more effort than a publisher of a dataset. As pointed out by Cleveland (1982), in an information rich environment most of the burden of accessing the proper information is carried by the user, not the producer. According to Taylor (1986) this type of information system is classified as provider (content) driven information system, distinctly different from a user driven Information system.

The center piece of the proposed project is the web service-based tool set: user-oriented Filtering and Identification of Networked Data (uFIND). The purpose of uFIND is to provide rich and powerful facilities for the user to discover and choose a desired dataset. These facilities include navigation through the multi-dimensional metadata space through the faceted search (uFind Pilot, 2009), ability to seamlessly access and browse datasets and to use uFINDs facilities as a web service for the mashups with other AQ applications and portals.

Expected Significance

The outcome of the proposed work will significantly reduce the burden on the user in finding and accessing data relevant to the understanding and management of air quality and atmospheric composition. Air quality is the scientific and research domain of the proposing team and the proposed IT technologies and infrastructure will be directly applicable to the furthering of research for this domain. Air quality is also a significant application area of satellite observations. The tools, methods and infrastructure will also be applicable to other domains of Earth Sciences.

The AQ data user benefits from uFIND by making it easier to find suitable data and accessing and utilizing the data in creating either scientific or actionable knowledge. The providers of AQ will also benefit since their data products will find easier re-use enhancing the relevance and importance of their products. Finally, the earth science community at large will benefit from the proposed work by the tools and methods for improved data discovery and access that is applicable to all earth science domains. The broad applicability of this proposed work will also be instrumental in the emerging Global Observing System of Systems (GEOSS) and in particular in the development of the GEOSS Common Infrastructure (GCI).

uFIND Use Case

The utility of the AQ uFIND is illustrated below through a use case applicable to air quality management. On May 12, 2007 an intense wildland fire broke out at the Okefenokee National Park in southern Georgia. The intense smoke from the fire engulfed southern Georgia and drifted toward Florida. By May 24, the smoke pall was also drifting northward along the Mississippi Vally, impacting much of the eastern U.S. The smoke increased the ambient aerosol concentration well over the ambient air quality standard (35ug/m3) causing exceedances. The air quality agencies in the smoke-impacted states were required to document that the exceedences were due to the Exceptional Event (EE) from the Okefenokee fires. In order to provide the required documentation, an air quality analyst of the impacted state would use uFIND (Fig.1).

AQDataFinder.png
Figure 1. Pilot Air Quality uFIND, http://webapps.datafed.net/geoss_catalog.aspx

In the first step, the analyst would define the spatial and temporal domain of the area of smoke impact: Lat 15 to 45, Lon -90 to -65, 2007-05-27. Under the search facet 'Domain' clicking 'Fire', would reveal the fire observations for that geographic area and time range. The selection shows that the dataset: NOAA_HMS_WFS from satellite platforms contains multiple fire-related parameters arising from a variety of sensors (AVHRR, GOES, MODIS). After a few minutes of browsing each parameter through the Data Viewer integrated in uFIND, the analyst chooses the parameter HMS_All to represent the fire locations. Next, the analyst chooses the 'Aerosol' domain to explore the various aerosol datasets and parameters, which yields hundreds of parameter records from numerous datasets. In order to confine the search, the analyst chooses 'Satellite' from the 'Platform' facet and 'Day' from 'Time Resolution' facet, which still yields dozens of satellite-derived parameters. Since her interests is in the geographic location of smoke, she chooses the parameter, AOD (Aerosol Optical Depth), which occurs in two datasets, one from MODIS4_AOT and Seaw_US. In order to assure that the satellite aerosol signal is due to light absorbing smoke, she also examines the parameter, 'ABS_AER', which is a smoke indicator in the OMI_AI_G dataset. Each of the selected datasets are accumulated as layers in the Data Browser.

As a next step, the analyst examines the magnitude of the smoke impact on the surface concentration of aerosols. Clicking 'Network' on the Platform facet and 'Hour' on the Time Resolution facet and 'Surface' on the Vertical facet returns all the hourly measured parameters on the surface, which can be found in three datasets: Airnow (EPA's real-time, surface monitoring network), AQS_H (EPA's additional hourly data) and Surf_Met (National Weather Service meteorological observations). For the smoke impact analysis, the analyst chooses the PM2.5 from Airnow and AQS_H and RHBext from SurfMet. These surface observations are also added to the stack of data layers in the Data Viewer. Having the location of the fires identified by the satellite-derived fire pixels, the smoke dispersion pattern from the satellite AOD and confirmation from the absorbing aerosol index she has sufficient documentation to satisfy the first requirement of the Exceptional Event Rule: Evidence that the smoke event occurred and that the smoke was observed in impacted areas.
Subsequently the examination of the surface observations reveals whether the satellite-observed smoke had a corresponding observed impact on the surface concentration. Through additional use of the available chemical composition data, she can also document that the excess aerosol mass concentration during the even can be attributed to smoke organic compounds.
The above described scenario is realistic and it was taken from an actual Exceptional Event case that was analyzed for EPA (Evidence for Flagging Exceptional Events , 2008). The original time expenditure for those analyses required weeks of effort by experts to find, access and analyze the data. With the aid of the new uFIND System, the same activity can be performed in a matter of hours and the resulting report can be more thorough because of the broader data pool available. We anticipate that similar time-saving and overall productivity improvement can be achieved in other areas of air quality science, management and policy. A particularly suitable application area is the international policy development regarding the Hemispheric Transport of Air Pollutants (HTAP), where satellite observations are crucial for documentation of intercontinental transport.

Technical Approach and Methodology

Information systems are practice driven. There are no natural laws to follow, the field is in pre-scientific age. Major input to the information system must come from an analysis of the information use environment, that a) Establishes the information flow into, within and out of an entity; b) Determines the criteria by which the value of information is judged. (Taylor, 1986).


There are major impediments to seamless and effective data usage encountered by both data providers and users. The impediments from the user's point of view are succinctly stated in the report by Research, P.O.I.T.A.T.C.O (1998), in short: the user can not find the data, if she can find them, she can not access them, if she can access then, she does not know how good they are, if she finds the data good, she can not merge them with other data. Similarly, the data providers face a similar set of hurdles: the provider can not find the users, if she can find users, she does not know how to seamlessly deliver the data, if she can deliver, she does not know how to make them more valuable to the users. This project intends to provide support to overcome the first two hurdles: finding and accessing the air quality data.

Architecture and Technologies Used

The center piece of the proposed project is the web service-based tool set: user-oriented Filtering and Identification of Networked Data (uFIND). The purpose of uFIND is to provide rich and powerful facilities for the user to discover and choose a desired dataset. These facilities include navigation through the multi-dimensional metadata space through the faceted search (uFind Pilot, 2009), ability to seamlessly access and browse datasets and to use uFINDs facilities as a web service for the mashups with other AQ applications and portals.
s (Fig.2).

SOA Metadata.png

Fig 2. Service Oriented Architecture and ISO 19115 Metadata Schematic
The critical aspect of SOA is the loose coupling between service providers and service users. Loose coupling is accomplished through plug-and-play connectivity facilitated by standards-based data access service protocols. SOA is the only architecture that we are aware of that allows both loose, dynamic connection and seamless flow of data between a rich set of provider resources and diverse array of users. In SOA service provider registers services in a suitable service registry, the users discover the desired service retrieve an access key and seamlessly access the data. The result is a dynamic binding mechanism for the construction of loosely-coupled work-flow applications.

Service orientation, has been accepted as the desired way of delivering Earth Observation data products. However, the adoption of formal standards-based data-as-service offerings have been slow within NASA and other Agencies. Offering images through OGC WMS standard interface is becoming common for many Federal Agency data products, but there is currently no effective way for the users to find those services since they are dispersed over many web pages.

This project will rely on mature and widely used standard protocols for interoperability:
(1) OGC WMS, WCS and WFS web services for accessing data;
(2) netCDF-CF as standard data formats for gridded and point monitoring data and models
(3) ISO 19115 for geospatial metadata
(4) RSS/Atom and HTTP for inter-service message transfer. (Fig. 4)

Furthermore, the uFIND software system will be implemented using three key maturing 'intellectual technologies': (1) Tagging for flexible, user-extensible structuring and annotation of diverse metadata; (2) Faceted search technology for navigation through multidimensional data discovery and (3) Ajax-based dynamic user interface to the data discovery and spatio-temporal data browsing and exploration.

The individual technologies described above can be considered at a technology readiness level (TRL) of at least 7. They have been developed and implemented in near operational environments. Many are used routinely in web data and information exchange among scientific and public domain communities. The work described in this proposal is targeted at combining those technologies and techniques in order to provider a user-driven uFIND.

Engineering and Implementation

The three major components the proposed uFIND, SOA-based software are: (1) facilities to identify AQ-relevant data services (2) facilities for publication of data as a services and (3) facilities for AQ users to find and access data services. In this design, each component is a service which allows the reuse of any part of the system in mashups with other web-based applications. Each of these major components will be supported by a set of tools and methods.

User-Oriented Filtering of Relevant Datasets

The AQ user needs will determine what datasets are initially targeted for inclusion in uFIND. The identified AQ Earth Observation priorities from the GEO Task US-09-01A (GEO Task US-09-01a Home Page, 2009) as well as other published, user-filtered dataset lists like the one prepared by Scheffe for EPA will guide the human and machine surveying of catalogs like GCMD, the GEOSS Clearinghouse, GeoSpatial OneStop, etc. in order to identify potential datasets. The identification of AQ-relevant datasets will be performed by the proposed team. However, it will expand to be a communal process. Initially, the proposing team, with the ESIP AQ Workgroup will work with AQ data providers and AQ users in order to populate uFIND with relevant AQ data. As the critical mass of participation is reached within the uFIND network it will allow the larger community of participants to act as a filter for what is valuable to the AQ Community.

Another batch of selected datasets that will come from several "data hubs" that already have a subset of datasets relevant to particular AQ analysts. Some examples are, EPA AQS for State Analysts, VIEWS for RPO's, NILU EMAP, NASA Giovanni as well as DataFed, developed by the proposing team. A full listing of current air quality data systems has been compiled during EPA's Data Summit in 2008 (Summit, 2008), which also contains many user recommendations. In uFIND these datasets will be harmonized, so that the data being offered through any of these hubs can be reused by multiple applications.

It is hoped that uFIND will also attract potential AQ data providers from the AQ community to participate through the incentive of reaching more data users as well as the immediate reward that when the dataset is accessed through standard formats it immediately can be used with many viewers and other client applications. Participation in uFIND will also allow providers to review web analytic information and connect with air quality users of their data, receiving both direct and indirect feedback about their data products.

Data Access

[File?id=ddxhh8bp_945g2x458hg_b ]The first condition of uFIND is providing data through standard interface. Datasets found through AQ uFIND will be accessible through standard OGC WCS, WMS data access protocols. This allows users to seamlessly access the data subset of interest. By adding standard interfaces, datasets can be converted into other user formats like netCDF, KML, accessible by data browsers for exploration.

Data as Service

OGC WCS is particularly applicable for representing space-time-varying phenomena in Fluid Earth Sciences, atmosphere and oceans. OGC WCS version 1.1 is limited to grids, or "simple� coverages, with homogeneous range sets but future revisions of the standard are anticipated to include support a broader set of coverages, including point coverages.

WCS DataAccess.png

Figure 3. a. WCS Communication Protocol b.WCS Data Wrappers


An attractive feature of these services is that (1) they can be executed using the simple, universal HTTP GET/POST Internet protocol; (2) the services are described by formal XML documents ("GetCapabilities", "DescribeCoverage") and the access instructions and output formats can be advertised in those service documents (Fig. 3a).

Wrappers and Tools

Most air quality datasets reside on servers in individual files or in SQL servers. These datasets need to go through a 'wrapping process' (Fig. 3b). Wrappers are reusable interfaces that turn data in files into data as service. Wrapper classes exist for: Sequential Images and Files, SQL server databases and for netCDF data files.

A particularly important wrapper class is for data files formated according to the netCDF-CF convention. Special effort will be invested to create a portable software template for accessing netCDF-formated data using the WCS protocol. It is hoped that this tool will promote the access and ease of use of distributed data.


The netCDF-CF file format is a common way of storing and transferring gridded meteorological and air quality model results. The CF convention for structuring and naming of netCDF-formated data further enhances the semantics of the netCDF files. The netCDF CF data format is most useful for the exchange of multidimensional gridded model data. It was also demonstrated that the netCDF format is well suited for the encoding and transfer of station monitoring data. Traditionally, satellite data were encoded and transferred using the HDF format. The new netCDF version 4 (beta) library provides a common API for netCDF and HDF-5 data formats. The netCDF-CF data format is supported by a robust set of well-documented and maintained low-level libraries for creating, maintaining and accessing data in that format for multiple platforms (Linux, Windows). The low level libraries provided by UNIDATA also offer a clear application programing interface (API).

The WCS wrapper for netCDF software has the triple functionality: (1) Accessing netCDF-CF files contents over the HTTP Get Internet protocol; (2) Imposing a standard data query language using the WCS standard; (3) Allow easy (non-intrusive) adaptation to evolving standards. The main components of the wrapper software are shown schematically in the Figure left. At the lowest level are open source libraries for accessing netCDF and XML files. At the next level are Python scripts for extracting spatially subset slices for specific parameters and times. At the third level, is the WCS interpreter that parses the WCS url. The Capabilities and Description files are created automatically from the NetCDF files, but you can provide a template containing information about your organization, contacts and other metadata. The WCS - netCDF-CF wrapper is a communal activity initiated by our group and pursued by collaborators from the US, New Zealand, Germany and elsewhere. It is also a participatory 'project' within the ESIP Air Quality Workgroup. It is anticipated that the wrapper will find wide application in this project.

AQ uFIND Metadata

The metadata has the primary purpose to facilitate finding and accessing the data in order to help dealing with first two hurdles that the users face. In current catalogs metadata is provided by the provider or distributor of the data. This metadata includes intrinsic discovery metadata such as spatial and temporal extent, keywords and contact information for the provider. The metadata also includes distribution information for data access. Additionally, providers include various other information to help users once they are at their site. This approach is provider-driven and does not incorporate the user needs.

In a user-centric information system the user experience is improved by metadata contributed along the entire line of data usage from the providers to the users. Furthermore, the datasets need to have additional AQ-relevant metadata added to metadata that the provider gives in order for the AQ user to easily find the data. This additional metadata allows for sharp queries to be given in the parameter space, time, and physical space. Another feature of the user-centric system is that using web analytics additional metadata is attached to each dataset in order to provide information about dataset usage characteristics.

The uFIND system incorporates the structured metadata along the data usage chain using the ISO 19115, Metadata for Geospatial Data Standard. ISO 19115 is ideal because it's structure accounts not only for traditional data access and discovery metadata, but also for usage, lineage and other metadata needed for understanding the data. The uFind record also includes pointers to additional metadata resources like a pointer back to the original metadata record from the data provider and a pointer to the associated DataSpace where the metadata record can be viewed.The ISO 19115 is also registered in the GEOSS Standards Registry, which allows uFIND to be harvested by the GEOSS Clearinghouse and found through other portals.
Metadata Flow Architecture

The major perceived contribution of the proposed work is an infrastructure for the harvesting, harmonization and continuous augmentation of metadata through the metadata flow system tailored to the use in air quality applications. A design of the metadata system is shown schematically in Figure 4.

MetadataFlowSchematic.png
Figure 4. Schematics of the uFIND metadata flow and components

The description below will follow the components schematics from left to right. It is to be noted that the above schematics is the representation of an actual functioning prototype that has been implemented for the GEOSS Architecture Implementation Pilot-II (GEO, 2009). However, it is anticipated that both the individual components as well as the overall functionality of the metadata flow infrastructure will continue to evolve during this project.

Provider-Contributed Metadata

The components to the left indicates the providers that are anticipated to supply metadata. Each supplier is connected to uFIND through two processes. The first step is to filter available metadata resources and select the records that are relevant to air quality. The second step is to transform the diverse source metadata to the uniform AQ metadata records adapted by uFIND (ISO 19115). It should be recalled that a pre-condition for registering an air quality-relevant dataset in uFIND, is that the data are accessible through an OGC standard data access protocol.


Metadata Tools

This proposed work will provide tools developed in collaboration with EU INSPIRE (INSPIRE, 2009) and others to create ISO 19115 metadata records either through the transformation of one metadata standard to ISO or by expanding the OGC GetCapabilities document. The tools simplify the process of generating metadata by a semi-automatic process. Metadata in various formats such as, FGDC, DIF, etc have crosswalks to ISO 19115. Using the published crosswalks, the metadata will be re-mapped into ISO 19115 through a style sheet transformation.

The proposing team has developed a metadata tool that maps corresponding GetCapabilities metadata elements to ISO 19115 metadata elements and provides a user interface to manually complete the remaining ISO 19115 metadata elements for which the GetCapabilities do not contain information. The provider is also able to modify the metadata elements that were extracted from the GetCapabilities and then save the record as an ISO 19115 compliant file. Each metadata record will incorporate additional AQ-specific metadata needed for the AQ uFIND as well as content acquired during the downstream usage. Prior to acceptance in uFIND, each ISO metadata record is also validated through a web service ISO 19115 validator.

Data Finding

Records are browsed using a customized, faceted search interface that was built to search the extended AQ metadata record set and find AQ data using specific filters such as sampling platform and data structure. Finding the right data is further enhanced by the user's ability to immediately view data as WMS through multiple clients provided by ESRI, Compusult and others. uFIND also allows query results to be embedded in another web page and also to link to multiple WMS viewers to browse layers in the catalog.

In uFIND, the AQ User can search for datasets through a faceted search which reacts to each step of the user's query. This dynamic interface ensures that the user can navigate by facets more familiar first, and make decisions about less familiar facets as the search is narrowed. The results returned will have additional metadata to aid decision-making like # of users that have used this datasets or links to places where it is used, so that the AQ-user is provided with some additional context.

Given that each AQ dataset is equipped with the standard data access interface, uFIND includes a Data Viewer. The user can browse data layers, compare them and further explore the data, ultimately choosing the most appropriate data for their application.

The AQ uFIND will be equipped with Google Analytics, which tracks multiple dimensions of data query usage as metrics. The analytics information will be exposed to the users as well and information derived from the metrics such as most popular queries will aid users in how to start their queries. The monitoring of query popularity is possible because each query has a unique URL that has a certain number of page views. The Google Analytics will also identify how most users access a dataset, i.e. through Google Earth, WMS, WCS, netCDF, etc. This feedback will help to hone the catalog to provide more useful information to users such as more datasets in a certain domain or tips for what others who viewed this dataset also viewed. The metrics will also help providers by identifying who some of the users of the data are and how they are finding a data product.

DataSpaces

Currently, metadata for air quality datasets is variable, distributed and normally created by the provider for the user. However, a single dataset can be used for many applications that the provider may or may not anticipate and the data may go through many value-adding processes before it reaches the "end user". Additional metadata can be created at any step along the usage chain and at this time there is no mechanism for collecting this metadata. Consequently, users don't know how a dataset has been used or what additional processing has occurred beyond the originator. One method to harvest and share metadata from all members of the usage chain is through community workspaces, DataSpaces. DataSpaces are virtual spaces for contributing and archiving metadata, discussing the dataset and harvesting distributed resources in order to capture the critical community knowledge about the dataset.


A DataSpace (Robinson, 2008) for a given dataset has two parts, structured, semantically rich metadata and flexible community-contributed metadata. The structured dataset description includes standard dataset metadata, data lineage, and data quality information such as provider, parameters, platform and time period. The additional value of the DataSpaces comes from the context provided by the dataset community: users, mediators and providers. This may be through links to other mediator or user-provided metadata, publications that reference the dataset or web applications and tools using the dataset. DataSpaces also provides a place where a dataset community can connect through discussion and announcements about the dataset. As DataSpaces evolves and is used more by the community, additional functionality will emerge. Currently, there are still many issues with the implementation of DataSpaces including how to link the DataSpace to the dataset as it moves along the usage chain and how material in DataSpaces can be reused in other metadata.


Connections to other Activities

uFIND is an open system and its functionality depends on inputs from AQ users, data providers as well as value-adding contributors. Most connections of uFIND system can be executed through formal service interfaces. This open architecture will be the key mechanism for the scalable growth and evolution of uFIND System.

For example within ESIP, we anticipate symbiotic interaction with the Semantic Web Cluster. uFIND with the thousands of data layers can provide a rich resource and be a use case for ontology development and testing. In return, the developed ontologies and catalog services will help uFIND improve the user experience. Similarly, it is anticipated that collaboration with various ESIP groups pertaining data provenance will be pursued and the results incorporated into the uFind metadata. This will include a provenance chaining service.

Collaborations with other activities will build upon previous and ongoing collaborations that have developed as part of the project team's earlier projects. Through recent community building exercises, such as the Federation of Earth Science Information Partners (ESIP) Air Quality Workgroup, GEOSS Architecture Implementation Pilot, the collaborative participants needed for contributing to a successful Air Quality uFIND have been already developed and through this project will be focused on a particular capabilities and needs.

Participation in ESDSWGs is expected to include the Technology Infusion working group. Rudolf Husar and Stefan Falke will contribute time, in the sum of 0.25 FTE, to contribute to these working groups. During a previous NASA REASoN and AIST projects Stefan Falke and Rudy Husar contributed to to the tech infusion, sensor web and reuse working groups. Based on previous experience, we anticipate that this ACCESS project would provide the most suitable contributions to the Technology Infusion Working Groups but will work with NASA to scope our participation.

It is worth noting that a similar requirement in an earlier NASA REASoN CAN compelled us to become involved in the Earth Science Information Partners (ESIP) and create and coordinate the ESIP Air Quality Workgroup which has been and ongoing collaboration forum over four years, with particularly active and broad participation over the last two years. We anticipate this ACCESS proposal to help continue the coordination and growth of the ESIP Air Quality Workgroup. The ESIP Air Quality Workgroup is a crucial partner in the proposed effort. In fact, we initially submitted this proposal's Notice of Intent with ESIP listed as the PI. However, it was determined that ESIP serving as PI introduced some conflicts with ESIP policies and by-laws and ESIP is now represented in the project as a coordination and collaboration environment in which to develop, test, and refine uFIND.

The AQ uFIND will be registered as a component in the GEOSS Component and Service Registry (CSR). The GEOSS Clearinghouses query the GEOSS CSR for catalogs and then periodically harvest the catalogs for their metadata records. From the point of view of the GEOSS Common Infrastructure, this particular functionality makes uFIND an Air Quality community catalog.

Extensions of Past Work and Impact of this Work

The proposed work is an extension of past research efforts conducted since 2001 on the development of the federated data system, DataFed (DataFedwiki; Husar, 2007; Husar, 2006; Husar, 2005). The DataFed development was supported by NSF, EPA as well as through the 5-year NASA REASoN grant, 2004-2009, "Application of ESE DATA and Tools to Particulate Air Quality Management." Over 100 standards-based datasets for air pollution data mediated through DataFed have been accessed by over a thousand repeat users from throughout the world (Fig.5).
DataFed Analytics.png
Figure 5. Location of top 100 users from Jan 1, 2007-June 20, 2009. Captured with Google Analytics

DataFed was also the data access system for the development and subsequent application of EPA's Exceptional Event Rule. The DataFed services have provided the key data source for many interoperability experiments for data processing and other workflow applications conducted through ESIP and GEOSS. Most recently, DataFed was a key contributor as part of the GEOSS Architecture Implementation Pilot (AIP)-II discussed in the Workplan section of this proposal.

Relevance to NASA Programs

In meeting its goal to, "Study Earth from space to advance scientific understanding and meet societal needs", the 2006 NASA Strategic Plan emphasizes that, "as new types of Earth observations become available, information systems, modeling, and partnerships to enable full use of the data for scientific research and timely decision support will become increasingly important." The proposed Air Quality uFIND is a tool to the ability for NASA and its data users to use existing and new data products and to develop partnerships to increase the usefulness of NASA data products in air quality decision support.
The Decadal Survey published by the National Academies of Science outlines objectives for future satellite earth observation missions for NASA and NOAA. It also stresses the importance of adequate information systems in order to make use of those new observations by concluding that, �fundamental improvements are needed in existing observation and information systems because they only loosely connect three key elements: (1) the raw observations that produce information; (2) the analyses, forecasts, and models that provide timely and coherent syntheses of otherwise disparate information; and (3) the decision processes that use those analyses and forecasts to produce actions with direct societal benefits.� The proposed project helps couple these three elements more closely by making it easier for earth observations users who create analyses and forecasts to find and access earth observations, thereby improving their ability to support decision processes.

The Air Quality uFIND is a demonstration of a modern, service-oriented architecture and the application of a small set of interoperability technologies that allows the creation of distributed, but interoperable data systems. While the particular application is for the domain of air pollution and atmospheric composition, the architecture, the technologies as well as the tools and methods are directly applicable to other science domains of interest to NASA such as climate change.

The Air Quality uFIND constitutes a conduit through which datasets registered in GCMD and other data directories can be delivered closer and easier to the air quality science and management community. Furthermore, the direct, two-way network link between uFIND and NASA data portals such as Giovanni constitute convincing demonstration of the viability of service-oriented, data networking.

Relevance to NRA Objectives

The uFIND infrastructure and the associated tools and methods constitutes a direct contribution to the ACCESS objective: to develop a "means for users to discover and use services being made available by NASA, other Federal agencies, academia, the private sector and others." Assessing the total life-cycle cost of this development is difficult since it's development has been pursued for about a decade and the process of service oriented data sharing will continuing the development well beyond this two year project. However, this is an important phase in the evolution of data networking because the applications have reached levels of TRL 7 or higher. This means that data from distributed providers can now be reliably and persistently found and accessed. In fact, we anticipate that at the end of the project, uFIND will incorporate several thousand air quality data layers that can be queried by sharp filters and immediately incorporated into browsers and processing applications.

The design of uFIND is initially targeted for the AQ-relevant datasets from NASA and the over hundred datasets already registered in DataFed as services. However, discussions are in progress for the implementation of a uFIND node as part of EPA's Exchange Network. Conceivably, additional uFIND nodes could be established at NOAA as well as at international agencies in Europe and Asia. These additional future activities are anticipated to be conducted in the architectural framework of GEOSS and uFIND nodes would constitute the components of a distributed Air Quality Community Catalog. Hence the life-cycle cost of the system will be the combined contributions for the system development and maintenance of the investments integrated over the time period of its evolution as well as over the multiple contributions by the community of its participants.

uFIND is not appropriate for all datasets. The design of uFIND is inherently targeting datasets that are expected to have the quality and value such that multiple applications can benefit from its use. Creating extensive metadata and provision of standard data access interface may be inapproriate for datasets that are of limited applicability for re-use. uFIND is also inappropriate for complicated datasets such as aircraft sampling data. Furthermore, uFIND is not suitable for the delivery of raw sampling calibration datasets.

General Work Plan

The basic workplan for the proposed 2-year project consists of the following activities:
Year 1: Focus on establishing connections to uFIND system components and their interactions. Increased connections to data providers.
  • Survey of data directories and data hubs for relevant AQ data
  • Develop community process for selecting air quality datasets
  • Continue the testing and improvement of WCS-netCDF data wrapper
  • Facilitating WCS interface to selected datasets
  • Continue to develop the metadata mapping tool for easy creation of AQ metadata records
  • Facilitate registration of AQ datasets in uFIND
  • Implementation of DataSpaces for key datasets

Outcome of year 1: AQ-relevant datasets that can be found through sharp queries of the metadata and accessed through OGC Standard Data Access Services

Year 2: Focus on testing and refining uFIND
  • Users test the data access system, based on the feedback flaws in the access services are eliminated.
  • Users test uFIND for usability flaws in metadata or connectivity modified
  • Connection and testing with uFIND and DataSpaces for compatibility and utility
Outcome of year 2: a well-tested and robust uFIND consisting of thousands of AQ-relevant data layers
It is anticipated that throughout this project collaboration would occur thorugh the GEOSS ADC AIP-III, ESIP and EPA's Exchange Network.

Current State of Application

The initial demonstration of uFIND was prepared as part of the GEOSS Architecture Implementation Pilot (AIP-II) in 2008-2009. During that pilot, the components shown in Figure 4 were functional and connected to the GEOSS Clearinghouse (GEOSS Clearinghouse FGDC, 2009). In fact, uFIND served as the Air Quality Community Catalog that was routinely harvested by the clearinghouse. The metadata for the air quality data records has been developed, such that faceted search on about 500 data records could be executed. During AIP-II the data access services were confined to OGC WMS images. WCS data services from providers other than DataFed were not available.


As part of this project the uFIND pilot implementation will be expanded in a number of ways such that at the end of the project uFIND will be a complete and robust package of services and tools, with graphic user interfaces. The specific developments in year 1 will include the connection to data directories and user-driven selection of several thousand data layers offered by distributed providers. A major effort will be invested in helping data providers to expose their data through the WCS standard interface. The WCS-netCDF data wrapper will undergo considerable testing and further development of functionality and usability. The metadata preparation tools will be expanded to accommodate additional facets for searching and also links to additional metadata such as data provenance. The uFIND search engine will be expanded for a broader range of facets and also for alternative user interfaces. The substantial development will consist of linking uFIND data records to data browsers such as Google Earth for spatial views and time charts for temporal data views. The interfaces of uFIND will also be streamlined to improve the incorporation of uFIND outputs to other client applications such as portals.

Management Approach

The major part of the proposed project will be performed by the core group of investigators listed in the table below. Professor Husar is the Principal Investigator and the architect of the federated data system, DataFed. He has long-term interest in environmental informatics with particular emphasis on tools for data access and exploration. Professor Falke is the Co-Investigator and expert in geospatial informatics. He has also been a community leader for the ESIP Air Quality Workgroup and other community activities. Erin Robinson is a doctoral candidate in the School of Engineering and has years of experience in environmental data analysis, and tools and technologies to assist collaborative research. Participation in this project will consititute a significant contribution to her Ph.D. thesis. Kari Hoijarvi is the chief software designer, programmer of the major software systems developed at CAPITA since the 1990s. His deep understanding of Service Oriented Programming and data processing will be key to the development of uFIND. Ed Fialkowski is a software developer with experience in networking, portal development and other applications development.

Name Organization Role/Contribution
Rudolf Husar Washington Univ. PI, uFIND Architect
Stefan Falke Washington Univ. Co-I, ESDWG, DataSpaces Connection
Erin Robinson Washington Univ. Graduate Student, uFIND Metadata
Kari Hoijarvi Consultant Finder Developer, WCS Server Developer
Ed Fialkowski Washington Univ. Metadata Tools Developer
The proposed project will rely heavily on the community contributions of the ESIP Air Quality Workgroup, the GEOSS Architecture Implementation Pilot Community and other communities such as the group associated with EPA's Air Quality Data Summit.

As in the past projects, this work will be performed as an open process with active participation of interested members from these communities. The community support will include design guidance, technology contributions and most importantly in establishment of service connections between uFIND and upstream and downstream services.

Data Sharing Plan

The entire basis of the Air Quality uFIND is data sharing, so the work plan outlined above is, in essence, our data sharing plan. However, we highlight certain data sharing aspects of the proposed approach here including those aspects that help other projects and programs enhance their data sharing capabilities. The data reuse will be enhanced through the service oriented architecture. The registered datasets are also directly accessible to air quality specific, work-flow based clients which can perform value-adding data processing and analysis.

Sharing research results with communities is vital to the continued development of uFIND and the infrastructures in which it is used. We will expose our results at ESTO technology conferences, AGU meetings and other forums conducive to new interactions with data providers and users. We will gain additional interaction with the earth science community through our continued leadership in the Earth Science Information Partners Federation (ESIP) and through GEO-related activities.

Operations Concept

In contrast to a stand-alone project that relies on its own individual initiatives to achieve persistence or sustainability, the Air Quality uFIND is expected to be closely coupled with other, larger efforts that provide opportunities for continued operation by various domain communities. For example, GEOSS is gradually being developed by assembling and developing components of the overall system and bringing them together. While a long range operations plan is yet to be defined, the expectation is that the successful development will lead to a sustainable infrastructure. The Air Quality uFIND is aimed at providing a critical component to the use of GEOSS by the air quality community and as a result, as the GEOSS operations plan is developed, uFIND will be part of that. A suggested approach for the continued operation of uFIND is through collaborations and resource sharing. Operations will be co-supported by others that would adapt it - in the U.S. EPA's Exchange Network.


References

Air Quality and Health Working Group (GeossPilot2). Available at: http://sites.google.com/site/geosspilot2/air-quality-and-health-working-group [Accessed June 26, 2009].

Air Quality Work Group - Federation of Earth Science Information Partners. Available at: http://wiki.esipfed.org/index.php/Air_Quality_Work_Group [Accessed June 26, 2009].

Cleveland, Harland, Information as a Resource, Futurist, 16, 34-39, 1982.

Data Summit Workspace - Federation of Earth Science Information Partners. Available at: http://wiki.esipfed.org/index.php/Data_Summit_Workspace [Accessed June 26, 2009].

Evidence for Flagging Exceptional Events - Federation of Earth Science Information Partners. Available at: http://wiki.esipfed.org/index.php/Evidence_for_Flagging_Exceptional_Events [Accessed June 26, 2009].

GEO Task US-09-01a Home Page. Available at: http://sbageotask.larc.nasa.gov/ [Accessed June 26, 2009].

GEO User Requirements for Air Quality - Federation of Earth Science Information Partners. Available at: http://wiki.esipfed.org/index.php/GEO_User_Requirements_for_Air_Quality [Accessed June 26, 2009].

GEOSS-CLEARINGHOUSE: Common Search Facility. Available at: http://clearinghouse.awcubed.com/perl-bin/ch_query [Accessed June 26, 2009].

Husar, R., Falke and K. Hoijarvi: Interoperability of Web Service-Based Data Access and Processing: Experience Using the DataFed System. ESTO Meeting, 2006. Paper A6P2.

Husar, R. & Poirot, R., 2005. DataFed and FASTNET: Tools for agile air quality analysis. EM-PITTSBURGH-AIR AND WASTE MANAGEMENT ASSOCIATION-, 39.

Husar, R.B. & Hoijarvi, K., 2007. DataFed: Mediated web services for distributed air quality data access and processing. In IEEE International Geoscience and Remote Sensing Symposium, 2007. IGARSS 2007. pp. 4016-4020.

Husar, R.B. et al., 2008. DataFed: An Architecture for Federating Atmospheric Data for GEOSS. IEEE Systems Journal, 2(3), 366-373.

INSPIRE geoportal. Available at: http://www.inspire-geoportal.eu/ [Accessed June 26, 2009].

Research, P.O.I.T.A.T.C.O. et al., 1989. Information Technology and the Conduct of Research: The User's View, National Academy Press.

Robinson, E.M. & Husar, R.B., 2008. DataSpaces: Using Community Workspaces to Enable Rich Air Quality Metadata. In American Geophysical Union, Fall Meeting 2008, abstract# IN22A-06.

Taylor, R.S. & Voigt, M.J., 1986. Value added processes in information systems, Greenwood Publishing Group Inc. Westport, CT, USA.

uFIND Pilot. Available at: http://webapps.datafed.net/geoss_catalog.aspx [Accessed June 26, 2009].