FROST Architecture Options

Centralized vs. Distributed Search for Datasets

At the dataset level, we are talking about databases on the order of 10^5, which can be searched with adequate performance in a centralized database. (The file level is more like 10^8, providing the rationale for distributing the search).

The Distributed model is construed to include those where OpenSearch Description documents are crawled by search engines, as well as searching multiple sites for dataset-level info.

Options for Centralized Dataset Search

Note that the below options are not mutually exclusive.

Global Change Master DIrectory

GCMD records a vast number of datasets related to Earth Science and Global Change. In theory, it holds records for all publicly available NASA Earth Science Datasets, plus additional datasets from ESIP members, and is an ESIP partner itself. GCMD has also shown an ability to map its DIF structure to a variety of different output formats. Key question would be the cost to accommodate the OpenSearch Description documents.

A key advantage is the ability to search on time and space constraints at the dataset level.

Mercury

Mercury is the official(?) directory for ESIP, and so in theory would make sense as the directory for any FROST implementation within the ESIP. Key questions would be:

how much cost to accommodate the OpenSearch Description documents (should be small?)
how up to date is Mercury for ESIP kept?

As with GCMD, a key advantage is the ability to search on time and space constraints at the dataset level.

Chaining the Dataset Level to the File Level Search

Return Links to OpenSearch Description Documents

In this option, the directory search provider would include in its response a link to OpenSearch Description Documents. For this to work, we would need to establish a convention in the Atom response element as to where it would be inside the response (a link with a rel attribute?) and what this link would be named (e.g., rel="OpenSearchDescription"?).

The examples on the OpenSearch Website show OSDDs being returned something like <atom:link rel="search" type="opensearchdescription+xml" ...> Now I have poked around some and there is a pretty clear description on how to serve the OSDDs. OpenSearch Autodiscovery--Matt Savoie (Savoie) 17:25, 11 September 2009 (EDT)

I strongly favor this option because it allows for the possibility of adding links to other service descriptions too. For example, if a particular data set supported an OGC service, it would be good to include that link too... Though actually I've been thinking of how to provide links to a "service cast" (i.e., contents thereof) -- Rduerr 17:44, 26 August 2009 (EDT)

Return OpenSearch Description Documents Directly

In this scheme, the directory being searched would return an OpenSearch Description Document directly. Typically, this would be implemented by adding a Template field to the directory in question, and having the directory map other fields to the OpenSearch Description Document elements. GCMD might map <Entry_ID> or <Entry_Title> to <ShortName>, and <Summary> to <Description>, also populating the <Tags> field with GCMD Keywords. (The Template would be added as a Related_URL.)

Options for Distributed Dataset Search

Servicecasting

Hook or Brian: do you have an idea how this might work?

Search Engine Tag Tracers

Analogous to the use of radioactive and fluorescent tracers in medicine, this involves including tags in OpenSearch Description documents that are unlikely in normal usage and unique to FROST. When indexed by major search engines, this would turn up only FROST OpenSearch Description documents. For example, one would add to the <Tags> element the string "FROST_OSDD", which currently returns no matches in Google.

This has the quite attractive benefit that most users start with these search engines anyway. The downside to this approach is that secondary indexers, which are sometimes indexed themselves, may pick up the term and contaminate the results. A mechanism would be needed to filter this. A second downside is that this would not support spatial or time constraints at the dataset level.