Discovery White Paper

From Earth Science Information Partners (ESIP)

NOTE: This is the ESIP Discovery Cluster's forum for working on a white paper for NSF's EarthCube program. If you feel that you have something positive to contribute and aren't yet a member of the cluster, feel free to join (i.e., the wiki and the monthly telecons). If you'd like to work on sections of the White Paper add your name to that section of the outline and start writing (note - multiple authors for a section are encouraged but should collaborate with each other)! We are looking for roughly 2-3 pages of text (plus images if any), so be pithy


Title: A Grassroots-Federated Approach to Earth Science Data Discovery

Authors: The Earth Science Information Partners Discovery Cluster (ESIP-DC)

Points of Contact: Ruth Duerr, rduerr@nsidc.org, Chris Lynnes, christopher.s.lynnes@nasa.gov, ...

Introduction - Ruth & Chris L.

The Grand Challenge

Earth science data is growing by leaps and bounds, not only in pure data volume, but also in variety. This bounty presents new opportunities to Earth science practitioners, providing more views into how Earth processes work and particularly how the various Earth systems interact. However, integrating these data sets poses key significant challenges as scientists try to incorporate the growing diversity into their scientific processes:

  • discovering data available from the ever-increasing set of data providers including individual researchers and research teams with limited technical support
  • employing tools and services with the data
  • harmonizing different datasets so they can be integrated

The NSF Earth Cube program takes square aim at the issues that make data integrability so difficult with its goal of a fully interoperable digital access infrastructure. Such an infrastructure would encompass discovery of data and services across disciplines and from distributed providers, interoperable services and data formats, and tools and services that can work with a wide range of data sets. This goal faces the technical challenge of fielding a suite of standards and technologies that not only foster interoperability but also interoperate with each other. This technical hurdle is paired with the socio-political challenge of achieving widespread adoption throughout a diverse Earth science community that spans disciplines, institutions, even funding agencies / organizations. The Earth Science Information Partners (ESIP) Federation faces similar challenges and the solutions that have been arrived at can provide a useful template for moving EarthCube forward. The ESIP Discovery Cluster in particular has developed a grassroots federated approach to the difficult issue of Discovery of both services and data.

The ESIP Federation

The ESIP Federation has been addressing interworkable data since its inception in 1998. The federation is composed of a wide variety of Earth science data, information and service providers: academic institutions, commercial interests, government agencies at federal, state and local levels, and non-governmental organizations. Members also cover a wide range of missions, from educational to research to applications, as well as a wide range of disciplines: solid-earth, oceanography, atmospheric sciences, land surface, ecology, and demographics.

This diversity has forced ESIP to confront many of the challenges to data integration. At the same time, it virtually mandates a loosely knit organization. While ESIP has a well-defined governance structure for with respect to business activities, technical progress most often comes out of ESIP "clusters". Clusters are self-organizing groups of people within the federation who come together to tackle a particular issue, with integration across the foundation usually the main goal. Some clusters are domain-focussed, such as the Air Quality cluster, while others are formed to address particular aspects of interoperability, such as the Discovery and Earth Science Collaboratory clusters.

The Discovery Cluster

The Discovery Cluster began as the Federated Search Cluster in 2009 to address the problem of discovering Earth science data over the widest possible variety of data providers. In keeping with the federated aspect of the ESIP Federation at large, a federated search solution was developed based on the OpenSearch (http://www.opensearch.org) conventions. In January of 2011, the Federated Search Cluster was broadened to include subscription based ("*casting") methods of discovery, at which point it was renamed the Discovery Cluster. The Discovery Cluster works to develop usable solutions to the problem of distributed and diverse providers, leveraging existing standards, conventions and technologies, with a predilection for simple solutions that have a high likelihood of voluntary adoption.

Technology framework - Chrises Mattman & Lynnes & Ruth

Federated Search Framework

The distributed, diverse nature of the ESIP federation mitigates in favor of a federated solution to the basic problem of Search. Federated search allows clients to search the contents of multiple distributed catalogs simultaneously, merging the results for presentation to the user. Federated search for Earth science data actually has a long history within the Earth science disciplines. The EOSDIS (Earth Observing System Data and Information System) Version 0 system implemented a federated search across eight remote sensing data centers as early as 1994. In the wider community, Amazon implemented a federated search named A9 to search across vendors several years ago. The conventions used for A9 eventually became known as OpenSearch (http://www.opensearch.org). OpenSearch has the virtue of being extremely simple to understand and implement. Simplicity is a critical aspect of any framework that needs widespread, voluntary adoption in a community with significant variation in technical capacity.

Datacasting Frameworks

While federated search is a pull model for making the location of data of interest known, requiring that clients know where potential data is likely to be found and explicitly enabling search of those sites, another model for advertising information in common use outside of science, is a publish and subscribe model. In this model, providers simply advertise the existence of their data on their website, using standard protocols such as RSS and ATOM. The streams of data advertisements (i.e, casts or feeds) then created are instantly discoverable, both by people who have subscribed; but also by web crawlers such as those used by Google and Yahoo, etc. Moreover, if the feeds are formatted using the same protocols as are used by OpenSearch, then systems typically called aggregators can find them and include them in federated search results. In this manner, investigators without significant technical support can make their data discoverable through a multitude of client applications without effort beyond that of generating a simple file and making it available on their web site.

In addition to advertising data collections as a whole, similar advertising mechanisms can be used to publish the existence of new data within a collection. In this manner subscribers are alerted to the existence of new data for example within a time series that meets their criteria. It is even possible to automatically obtain the data upon receipt of the advertisement.

Service Casting Framework

Discovery and use of Web services for querying, accessing, and processing science datasets is hampered by the lack of open and web-scalable services and data registries that provide rich and complete search for all available services, datasets, interesting geophysical events, and data granules relevant to studying those events. Centralized registries, such as NASA’s Global Change Master Directory (GCMD), the Earth Observing System Clearinghouse (ECHO), and Group on Earth Observations System of Systems (GEOSS) Registry, are useful for basic metadata search but due to their inherent paradigm they suffer from various failings: they often have cumbersome interfaces for registering services or datasets, thereby presenting a barrier to adoption; each use their own metadata standard, which may not be easily machine-readable or interoperable; they don’t evolve their metadata fields (registered information) rapidly to suit changing user needs; their search results are often a long catalog without relevance ranking to help the user; and they compete with each other for adoption, thereby fragmenting the user base. To find all available web services, the user must search all three registries (do meta-search), and even then the resulting set will omit many unregistered services and datasets.

The purpose of a services and dataset registry is to provide “discovery” (search) of the services and datasets. However, as Google search has amply demonstrated for information on the Web, one need not explicitly “push” metadata into a centralized registry in order to enable discovery. Instead, a service provider can simply “advertise” their service or dataset by publishing structured metadata of their choice on the web as a syndicated (broadcast) feed. Search providers can then “discover” the data or service and its metadata by crawling, “pull” the metadata into an indexed aggregation, and compete to provide rich search services (e.g. text keyword, hierarchical taxonomy, or semantic faceted search). Central registries don’t scale well in size or flexibility because they require users to “push” appropriate (“asked for”) information into multiple registries. An “open publish and crawl” (pull) system is inherently more scalable, flexible, and extensible.

Service casting, similar to data casting, is providing a framework for allowing service producers to publish/advertise their services using a community-derived definition, implemented in the standard protocols of RSS and ATOM. The public availability of these published service "feeds" allow them to be discovered and/or subscribed to without using any specialized applications or tools. The service providers remain in control of their published service descriptions and can easily keep them up-to-date and complete. This standards-based approach of providing service definitions supports the development of specialized tools for feed authoring and aggregation that will be consistent for all service casts that are made available through the community-embraced convention. A prototype project that provides information and tools for service casting is available at [[1] http://ws3dev.itsc.uah.edu/infocasting/] .

Beyond Discovery - Integrating Frameworks Together

One problem that is largely unsolved is that of linking data with the services that operate on them. An exemplar of this issue is the services offered through the Open-source Project for a Network Data Access Protocol (OPeNDAP). OPeNDAP is used widely among ESIP Federation partners to provide remote data access. It has several attractive features, such as the ability to subset data remotely and several possible "response types" that provide data and metadata in a variety of formats....

Example Scenario - Ruth

Following is a generalized scenario that Discovery cluster members are working to implement that illustrates how all these technologies unite to enable data and service discovery, publishing, and use.

  • An investigator starting a new project uses their favorite portal to obtain a list of data and services meeting their spatial, temporal, and free text criteria. Since the portal conducts a federated search across the wide variety of aggregators and data centers containing data that is typically of interest to the investigator, the results contain any advertised data set meeting their criteria no matter whether the data is held by an individual investigator or any one of the multitude of data providers that exist world-wide
  • The investigator examines several data sets and subscribes to be notified whenever new services or service updates for those data sets become available
  • The investigator also subscribes to be notified when a new data set meeting their query criteria is announced or when a data set already on their subscription list is updated
  • The investigator finds results from two locations that are of interest. They retrieve the data from the first location using the link to the data set provided.
  • While perusing the service descriptions available for one of the data sets, the investigator realizes that a granule (e.g., file) level OpenSearch service produces KML output compatible with their favorite GIS analysis tool, so they use that tool to browse the latest data, decide that the data is adequate, and download it all from within their analysis environment using a well-known protocol such as OPeNDAP or Web Coverage Service.
  • One morning, while perusing their feed reader, the investigator becomes aware that a new data set meeting their criteria has been published. They examine that data set and add it to their data set subscriptions.
  • The investigator examines the services available for the new data set and is disappointed to find that no granule OpenSearch or equivalent service is available but that one is planned. They subscribe to receive service updates for the data set.
  • Eventually the service they were interested in is released and they are automatically notified of that event (when the service cast is updated) )
  • As a part of her work, the investigator generates a new data set that she decides should be published. She chooses to use an open source web-based data cast creator to announce availability of the data via ftp from her web site
  • A number of special purpose aggregation crawlers (for example one aggregating all geologic data held anywhere) discover the new data cast while searching the web and add the data cast to their aggregations. Other investigators who subscribed to the those aggregators are notified of the new data and may contact the investigator to see if they can use it for their research.

Governance - Hook

Looking to the future - Chris Lynnes

The preceding text demonstrates how a lightweight standard or convention can nonetheless enable significant interoperability with respect to discovering data and services, and furthermore, how similar, interlocking conventions can provide cross-cutting interoperability, in this case between services and data. However, these are not the only Earth science entities that we should like to encompass in our drive to make systems "interworkable". Data and services (or tools) can be combined in sequences to form scientific workflows. The analysis results from executing these workflows may also be thought of in a fashion similar to data. And the results themselves may be aggregated into an experiment, in much the same way that different model runs are aggregated into an ensemble. Many of the key discovery attributes of workflows, results and experiments can be inherited from the data and service building blocks from which they are made. As a result, it is not too ambitious to hope that the entire "information stack", from data and services, up through workflows, results and experiments, can be interoperable (or interworkable) both horizontally (data with data, result with result) and vertically (data with tool with workflow with result with experiment). Such an interoperability framework would convey the key advantage of presenting everything in the proper context: a given result could be traced back down through the analysis workflow to the tools/services and data that went into the result. This rich context would be further enhanced by supported some basic social networking technology, allowing researchers to annotate any level of the information stack (from data/service up through experiment) with contextual knowledge.

Such an "Earth Science Collaboratory (ESC)" (Fig. x) has been proposed within the ESIP Federation, with an Earth Science Collaboratory Cluster formed to push the idea forward. The ESC would allow researchers to share not just data, but tools, services, analysis workflows (i.e., techniques), and resutls as easily as links are shared today in tools such as Facebook, thus preserving the full context of a given result as well as the contextual knowledge added by the researcher. However, there are potential benefits for many other types of user. For instance, science assessment committees would be able to share with each other both the (usually highly processed) end results and articles but also the input data and tools, greatly increasing transparency of the assessment. Novice graduate students would be able to "follow" more experienced researchers in the field, thus learning how to handle the data properly and avoiding common pitfalls. Educators would be able to put together science stories that trace back to the original data, allowing them to give students exposure to what "real" data look like, and how they are eventually processed to yield a compelling story. Users of Decision Support Systems (DSS) would be able to collaborate in real time with the scientist whose research is incorporated into the DSS, providing a valuable bridge over the chasm that often separates research and operations.

Such an Earth Science Collaboratory faces a number of hurdles, both technical and non-technical. However, the NSF EarthCube is aligned along the same axis, and could therefore provide the critical impetus toward realization of the ESC.

Conclusion (All)

Discovery and Earth Science Collaboratory Cluster Participants

  • Clynnes 10:35, 15 September 2011 (MDT)