Discovery White Paper

NOTE: This is the ESIP Discovery Cluster's forum for working on a white paper for NSF's EarthCube program. If you feel that you have something positive to contribute and aren't yet a member of the cluster, feel free to join (i.e., the wiki and the monthly telecons). If you'd like to work on sections of the White Paper add your name to that section of the outline and start writing (note - multiple authors for a section are encouraged but should collaborate with each other)! We are looking for roughly 2-3 pages of text (plus images if any), so be pithy

Edited and Condensed Version

Title: A Federated Approach to Earth Science Data Discovery

Authors: The Earth Science Information Partners Discovery Cluster (ESIP-DC)

Points of Contact: Ruth Duerr, rduerr@nsidc.org, Chris Lynnes, christopher.s.lynnes@nasa.gov, ...

Introduction - Ruth & Chris L.

The Grand Challenge

Earth science data is growing fast and accelerating, not only in pure data volume, but also in variety. This bounty presents new opportunities to Earth science practitioners, providing new views into how Earth processes work and particularly how the various Earth systems interact. To truly understand the Earth system, however, requires integration of this growing diversity of data. How can scientists discover, access, and associate data from myriad disciplines, each with their own cultures of data collection, structure, and description? This integrated view poses significant challenges, including:

discovering data available from the ever-increasing set of data providers including individual researchers and research teams with limited technical support (simply knowing where to look is a challenge)
determining fitness for use and assessing data uncertainty outside one's own domain knowledge
employing specific tools and services with the data
harmonizing the structure or semantics of different data sets so they can be integrated

The NSF Earth Cube program takes square aim at these issues that make data integrability so difficult with its goal of a fully interoperable digital access infrastructure. Such an infrastructure would encompass discovery of data and services across disciplines and from distributed providers, interoperable services and data formats, and tools and services that can work with a wide range of data sets. This goal faces the technical challenge of fielding a suite of standards and technologies that not only foster interoperability but also interoperate with each other. This technical hurdle is paired with the socio-political challenge of achieving widespread adoption throughout a diverse Earth science community that spans disciplines, institutions, even funding agencies and organizations.

The Federation of Earth Science Information Partners (ESIP) has been addressing some of these issues for years, and we believe believe we have initial solutions that can provide a useful template for moving EarthCube forward. A sub group within ESIP, the ESIP Discovery Cluster, has been working on the particular issue of discovery. We find that traditional methods reliant on centralized registries do not scale across disciplines and do not allow adequate description of data to diverse audiences. We have, instead, developed a basic, grassroots, federated approach to discovery that does not rely on centralized registries. Further, recognizing that data are often useless without associated services, we also enable discovery of data services and associate them with relevant data.

The ESIP Federation

The ESIP Federation has been addressing interworkable data since its inception in 1998. The federation is composed of a wide variety of Earth system science data, information and service providers: academic institutions, commercial interests, government agencies at federal, state and local levels, and non-governmental organizations. Members also cover a wide range of missions, from educational to research to applications, as well as a wide range of disciplines: solid-earth, oceanography, atmospheric sciences, land surface, ecology, and demographics.

This diversity has forced ESIP to confront many of the challenges to data integration. At the same time, it virtually mandates a loosely knit organization. While ESIP has a well-defined governance structure with respect to business activities, technical progress most often comes out of ESIP "clusters". Clusters are self-organizing groups of people within the federation who come together to tackle a particular issue, with integration across the foundation usually the main goal. Some clusters are domain-focussed, such as the Air Quality cluster, while others are formed to address particular aspects of information management, such as the Discovery, Preservation & Stewardship, Semantic Web. More recently, a new Earth System Science Collaboratory cluster has formed to help relate the diverse activities of ESIP and enhance scientific collaboration around data. Very much in the spirit of the EarthCube.

The Discovery Cluster

The Discovery Cluster began as the Federated Search Cluster in 2009 to address the problem of discovering Earth science data over the widest possible variety of data providers. In keeping with the federated aspect of the ESIP Federation at large, a federated search solution was developed based on the OpenSearch (http://www.opensearch.org) conventions. In January of 2011, the Federated Search Cluster was broadened to include subscription based ("*casting") methods of discovery, at which point it was renamed the Discovery Cluster. The Discovery Cluster works to develop usable solutions to the problem of distributed and diverse providers, leveraging existing standards, conventions and technologies, with a predilection for simple solutions that have a high likelihood of voluntary adoption.

Technology framework - Chrises Mattman & Lynnes & Ruth

Open Source Understanding Framework

Part of the expertise within the group lies within the realm of understanding the use and production of open source software for use in Earth discovery and federation systems, but additionally within data management systems in a broader context (e.g., within the realm of bioinformatics, astronomy, planetary science, etc.). The expanding availability of open source software and associated software licenses, community models ("help desk" wherein which a centralized company controls acceptance of contributions and the source code; compared to "community" where meritocracy leads to source code access, and governance), open source providers (Sourceforge, Google Code, the Apache Software Foundation, etc.), IP issues, interaction models and the like have drawn into focus a need for an open source understanding framework. This is a specific focus of a subset of members within the group. Our team has made two recent presentations describing and presenting this framework to the broader open source software community. The first, at NASA's 1st Open Source Summit (OSS), suggested a model for NASA to follow for consuming and producing open source software, and identified the strategic dimensions and tradeoffs there. The second presentation, an interactive workshop at the 2011 Summer ESIP Federation Meeting, brought a shorter version of the NASA OSS talk, along with a panel of experts from NASA and from the NSF's NEON project to discuss the 13 recommendations from the NASA OSS. We anticipate the need for our Open Source Understanding Framework within the Earth Cube and within its array of projects requiring off-the-shelf software components; and the selection and production of software from and in to the open source community.

Federated Search Framework

The distributed, diverse nature of the ESIP federation mitigates in favor of a federated solution to the basic problem of Search. Federated search allows clients to search the contents of multiple distributed catalogs simultaneously, merging the results for presentation to the user. Federated search for Earth science data actually has a long history within the Earth science disciplines. The EOSDIS (Earth Observing System Data and Information System) Version 0 system implemented a federated search across eight remote sensing data centers as early as 1994. In the wider community, Amazon implemented a federated search named A9 to search across vendors several years ago. The conventions used for A9 eventually became known as OpenSearch (http://www.opensearch.org). OpenSearch has the virtue of being extremely simple to understand and implement. Simplicity is a critical aspect of any framework that needs widespread, voluntary adoption in a community with significant variation in technical capacity.

Datacasting Frameworks

While federated search is a pull model for making the location of data of interest known, requiring that clients know where potential data is likely to be found and explicitly enabling search of those sites, another model for advertising information in common use outside of science, is a publish and subscribe model. In this model, providers simply advertise the existence of their data on their website, using standard protocols such as RSS and ATOM. The streams of data advertisements (i.e, casts or feeds) then created are instantly discoverable, both by people who have subscribed; but also by web crawlers such as those used by Google and Yahoo, etc. Moreover, if the feeds are formatted using the same protocols as are used by OpenSearch, then systems typically called aggregators can find them and include them in federated search results. In this manner, investigators without significant technical support can make their data discoverable through a multitude of client applications without effort beyond that of generating a simple file and making it available on their web site.

In addition to advertising data collections as a whole, similar advertising mechanisms can be used to publish the existence of new data within a collection. In this manner subscribers are alerted to the existence of new data for example within a time series that meets their criteria. It is even possible to automatically obtain the data upon receipt of the advertisement.

Service Casting Framework

Discovery and use of Web services for querying, accessing, and processing science datasets is hampered by the lack of open and web-scalable services and data registries that provide rich and complete search for all available services, datasets, interesting geophysical events, and data granules relevant to studying those events. Centralized registries, such as NASA’s Global Change Master Directory (GCMD), the Earth Observing System Clearinghouse (ECHO), and Group on Earth Observations System of Systems (GEOSS) Registry, are useful for basic metadata search but have issues stemming from the registry paradigm: complex interfaces for registering services and datasets, hampering adoption; difficulty evolving metadata fields (registered information) rapidly to suit changing requirements; search results that are difficult to rank by relevance; and even competition with each other, fragmenting the user base. As a result, users must sometimes search multiple registries to find all available web services, with the risk of still missing those that are not registered in any registry.

The purpose of a services and dataset registry is to provide “discovery” (search) of the services and datasets. However, as Google search has amply demonstrated for information on the Web, one need not explicitly “push” metadata into a centralized registry in order to enable discovery. Instead, a service provider can simply “advertise” their service or dataset by publishing structured metadata of their choice on the web as a syndicated (broadcast) feed. Search providers can then “discover” the data or service and its metadata by crawling, “pull” the metadata into an indexed aggregation, and compete to provide rich search services (e.g. text keyword, hierarchical taxonomy, or semantic faceted search). Central registries don’t scale well in size or flexibility because they require users to “push” appropriate (“asked for”) information into multiple registries. An “open publish and crawl” (pull) system is inherently more scalable, flexible, and extensible.

Service casting, similar to data casting, is providing a framework for allowing service producers to publish/advertise their services using a community-derived definition, implemented in the standard protocols of RSS and ATOM. The public availability of these published service "feeds" allow them to be discovered and/or subscribed to without using any specialized applications or tools. The service providers remain in control of their published service descriptions and can easily keep them up-to-date and complete. This standards-based approach of providing service definitions supports the development of specialized tools for feed authoring and aggregation that will be consistent for all service casts that are made available through the community-embraced convention. A prototype project that provides information and tools for service casting is available at [[1] http://ws3dev.itsc.uah.edu/infocasting/] .

Beyond Discovery - Integrating Frameworks Together

One problem that is largely unsolved is that of linking data with the services that operate on them. An exemplar of this issue is the services offered through the Open-source Project for a Network Data Access Protocol (OPeNDAP). OPeNDAP is used widely among ESIP Federation partners to provide remote data access. It has several attractive features, such as the ability to subset data remotely and several possible "response types" that provide data and metadata in a variety of formats....

Example Scenario - Ruth

Following is a generalized scenario that Discovery cluster members are working to implement that illustrates how all these technologies unite to enable data and service discovery, publishing, and use.

An investigator starting a new project uses their favorite portal to obtain a list of data and services meeting their spatial, temporal, and free text criteria. Since the portal conducts a federated search across the wide variety of aggregators and data centers containing data that is typically of interest to the investigator, the results contain any advertised data set meeting their criteria no matter whether the data is held by an individual investigator or any one of the multitude of data providers that exist world-wide
The investigator examines several data sets and subscribes to be notified whenever new services or service updates for those data sets become available
The investigator also subscribes to be notified when a new data set meeting their query criteria is announced or when a data set already on their subscription list is updated
The investigator finds results from two locations that are of interest. They retrieve the data from the first location using the link to the data set provided.
While perusing the service descriptions available for one of the data sets, the investigator realizes that a granule (e.g., file) level OpenSearch service produces KML output compatible with their favorite GIS analysis tool, so they use that tool to browse the latest data, decide that the data is adequate, and download it all from within their analysis environment using a well-known protocol such as OPeNDAP or Web Coverage Service.
One morning, while perusing their feed reader, the investigator becomes aware that a new data set meeting their criteria has been published. They examine that data set and add it to their data set subscriptions.
The investigator examines the services available for the new data set and is disappointed to find that no granule OpenSearch or equivalent service is available but that one is planned. They subscribe to receive service updates for the data set.
Eventually the service they were interested in is released and they are automatically notified of that event (when the service cast is updated) )
As a part of her work, the investigator generates a new data set that she decides should be published. She chooses to use an open source web-based data cast creator to announce availability of the data via ftp from her web site
A number of special purpose aggregation crawlers (for example one aggregating all geologic data held anywhere) discover the new data cast while searching the web and add the data cast to their aggregations. Other investigators who subscribed to the those aggregators are notified of the new data and may contact the investigator to see if they can use it for their research.

Governance - Hook

The organization of the Discovery Cluster is volunteer based and consensus driven. Further, in keeping with our lightweight approach, we rely on open source software. Correspondingly, the expanding amount of open source software and licenses, software management models and providers, IP issues, and community interaction models highlights a need for an "open source understanding framework". Members of the Cluster have been developing this framework and have been presenting it to NASA, ESIP, and NSFs NEON project. We have suggested a community-governed model for NASA to follow for consuming and producing open source software, have identified strategic dimensions and tradeoffs and have developed an initial set of recommendations to move forward.

In an interoperability regime involving multiple government agencies at all levels, academia, the commercial world and even citizen scientists, it is difficult for any single organization or group to set forth interoperability standards for all to follow. Instead, the EarthCube governance should allow for emergent governance structures that can leverage the efforts of both volunteer and funded participants. The cluster formation within the ESIP federation provides an example of how this can work.

With multiple data centers from a variety of different organizations now involved with the Discovery Cluster activities, resolving issues as a virtual organization becomes increasingly difficult. The Cluster routinely collaborates to resolve issues related to interoperability, distributed services, and adoption of different open standards. While ESIP is not a standards body, a structured process is still needed to coordinate the various viewpoints, decisions, and interoperability issues. Therefore developing an agreed upon process, and then following it, becomes critical to the ongoing collaboration of the various organizations that compose the Cluster.

Process

Given the agile nature of the grass roots-like developments of Discovery Cluster’s lightweight approach to Earth science data discovery, we needed a governance process that was also lightweight and agile. In 2010, the Cluster adopted a governance process borrowing useful ideas from the Open Provenance Model governance process, which had similar lightweight and agile needs. The Cluster’s governance process encompasses the following steps:

Submission of new proposals
Forum to review proposals
Author revision based on feedback
Voting on change proposals
Ratification or rejection by editors

To maintain an open community process, all steps are posted to the mailing list and/or wiki.

Discovery Change Proposal (DCP)

All changes to the Discovery specifications must go through the governance process starting with a Discovery Change Proposal (DCP). Each DCP includes various information that provide context for the proposal such as:

A unique identifier. An incrementing number is commonly used in these types of community-based processes.
Timestamp of when proposal was submitted.
Proposal review period (implies a deadline)
Current governance step (submitted, proposal review, revision, vote, final review, ratified, rejected)
Background context
Problem addressed
Proposed solution
Rationale for the solution
Validation for specification (where appropriate. e.g. XML Schema validator)

Looking to the future - Chris Lynnes

The preceding text demonstrates how a lightweight standard or convention can nonetheless enable significant interoperability with respect to discovering data and services, and furthermore, how similar, interlocking conventions can provide cross-cutting interoperability, in this case between services and data. However, these are not the only Earth science entities that we should like to encompass in our drive to make systems "interworkable". Data and services (or tools) can be combined in sequences to form scientific workflows. The analysis results from executing these workflows may also be thought of in a fashion similar to data. And the results themselves may be aggregated into an experiment, in much the same way that different model runs are aggregated into an ensemble. Many of the key discovery attributes of workflows, results and experiments can be inherited from the data and service building blocks from which they are made. As a result, it is not too ambitious to hope that the entire "information stack", from data and services, up through workflows, results and experiments, can be interoperable (or interworkable) both horizontally (data with data, result with result) and vertically (data with tool with workflow with result with experiment). Such an interoperability framework would convey the key advantage of presenting everything in the proper context: a given result could be traced back down through the analysis workflow to the tools/services and data that went into the result. This rich context would be further enhanced by supported some basic social networking technology, allowing researchers to annotate any level of the information stack (from data/service up through experiment) with contextual knowledge.

Such an "Earth Science Collaboratory (ESC)" (Fig. x) has been proposed within the ESIP Federation, with an Earth Science Collaboratory Cluster formed to push the idea forward. The ESC would allow researchers to share not just data, but tools, services, analysis workflows (i.e., techniques), and results as easily as links are shared today in tools such as Facebook, thus preserving the full context of a given result as well as the contextual knowledge added by the researcher. However, there are potential benefits for many other types of user. For instance, science assessment committees would be able to share with each other both the (usually highly processed) end results and articles but also the input data and tools, greatly increasing transparency of the assessment. Novice graduate students would be able to "follow" more experienced researchers in the field, thus learning how to handle the data properly and avoiding common pitfalls. Educators would be able to put together science stories that trace back to the original data, allowing them to give students exposure to what "real" data look like, and how they are eventually processed to yield a compelling story. Users of Decision Support Systems (DSS) would be able to collaborate in real time with the scientist whose research is incorporated into the DSS, providing a valuable bridge over the chasm that often separates research and operations.

Such an Earth Science Collaboratory faces a number of hurdles, both technical and non-technical. However, the NSF EarthCube is aligned along the same axis, and could therefore provide the critical impetus toward realization of the ESC.

Conclusion (All)

Discovery and Earth Science Collaboratory Cluster Participants

Clynnes 10:35, 15 September 2011 (MDT)
Kskuo 13:04, 15 September 2011 (MDT)
Hook 15:37, 15 September 2011 (MDT)
Ctilmes 10:14, 19 September 2011 (MDT)
Ken Keiser (Keiser) 08:57, 20 September 2011 (MDT)
Brianwee 10:43, 20 September 2011 (MDT)
Parsonsm 13:47, 6 October 2011 (MDT)
Bdwilson 14:15, 11 October 2011 (MDT)

Supporting the Whitepaper

JamesGallagher 14:15, 11 October 2011 (MDT)
Rahul Ramachandran
Helen Conover