How-To Guide for Implementing ESIP Federated Search Servers

From Earth Science Information Partners (ESIP)

Introduction

ESIP Federated Search is a simple framework for doing a federated (distributed) query among participating members for Earth science data. It is based on the OpenSearch convention for distributed searches, which centers around OpenSearch Description Documents. These XML documents include a template that shows how to construct a URL in order to execute a query against a particular search engine.

The ESIP Federated Search includes certain conventions to support a two-step dataset/granule (file) query. In the first step, a keyword query (and sometimes space-time criteria) is issued for datasets. In the second step, each selected dataset is queried for granules matching space-time (and possibly keyword) criteria.

Esip 2step.png

The key element linking the two steps is an OpenSearch Description document for each dataset which describes how to do the granule (file) search for that dataset.

What Do I Need for an ESIP Federated Search Server?

In a nutshell, you need four things:

  1. A dataset search engine that supports at least keyword (free-text) search, and optionally space-time constraints
  2. An OpenSearch Description Document describing the dataset search engine
  3. A granule-level search engine supporting space-time query
  4. An OpenSearch Description Document for each dataset that describes the granule-level search template for that dataset

Let's look at each of these in detail.

Dataset Search Engine

The Dataset Search Engine should return an Atom document with the dataset results.

<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/" xmlns:georss="http://www.georss.org/georss" 
            xmlns:geo="http://a9.com/-/opensearch/extensions/geo/1.0/" xmlns:time="http://a9.com/-/opensearch/extensions/time/1.0/">
<author><name>GES DISC</name><email>mirador-disc@listserv.gsfc.nasa.gov</email></author>
<opensearch:itemsPerPage/><updated>2009-11-18T18:30:02Z</updated>
<title>Mirador collection results for Monoxide</title>
<id>http://mirador.gsfc.nasa.gov/cgi-bin/mirador/collectionlist.pl</id>
<subtitle type="html">Monoxide (distributed by GES DISC)
        </subtitle>
<link rel="self" href="http://mirador.gsfc.nasa.gov/cgi-bin/mirador/collectionlist.pl"/>
<link rel="http://esipfed.org/ns/fedsearch/1.0/search#" href="http://mirador.gsfc.nasa.gov/cgi-bin/mirador/granlist.pl"/>
<entry>
<id>http://mirador.gsfc.nasa.gov/OpenSearch/mirador_opensearch_ML2CO.002.xml</id>
<updated>2009-11-18T18:30:02Z</updated>
<author><name>GES DISC</name><email>mirador-disc@listserv.gsfc.nasa.gov</email></author>
<title>MLS/Aura L2 Carbon Monoxide (CO) Mixing Ratio (ML2CO) </title>
<link href="http://mirador.gsfc.nasa.gov/cgi-bin/mirador/granlist.pl?page=1&dataSet=ML2CO&version=002&allversion=002&keyword=Monoxide&pointLocation=(-90,-180),(90,180)&location=(-90,-180),(90,180)&searchType=Location&event=&startTime=2009-10-10&endTime=2009-10-11 23:59:59&search=&CGISESSID=f408c488319554acd03731525a55f5a8&nr=4&temporalres=1%20Day(s)&prodpg=http://mirador.gsfc.nasa.gov/collections/ML2CO__002.shtml&longname=MLS/Aura L2 Carbon Monoxide (CO) Mixing Ratio&granulePresentation=ungrouped" rel="http://esipfed.org/ns/fedsearch/1.0/data#"/>
<time:start>2004-08-08</time:start>
<time:end>2009-12-14</time:end>
<summary type="html">Dataset:ML2CO.002(1)</summary>
<link rel="search" type="application/opensearchdescription+xml" title="ML2CO.002" href="http://mirador.gsfc.nasa.gov/OpenSearch/mirador_opensearch_ML2CO.002.xml"/>
<link rel="enclosure" type="text/html" href="http://mirador.gsfc.nasa.gov/collections/ML2CO__002.shtml" title="/OpenSearch/mirador_opensearch_ML2CO.002.xml info"/>
</entry>

Note especially the <link> entry with type "application/opensearchdescription+xml". This is the URL to the OpenSearch Description Document describing how to construct a URL to search for ML2CO.002 data.

Dataset Search OpenSearch Description Document

Once your have a dataset search that can return Atom results of the kind above, the next step is to create an OpenSearch Description Document for the dataset-level search. Here is an example:

<?xml version="1.0" encoding="UTF-8"?>
<OpenSearchDescription xmlns="http://a9.com/-/spec/opensearch/1.1/">
  <ShortName>Mirador Dataset Search</ShortName>
  <Description>Use Mirador Dataset Search to obtain a list of Earth Science Data Sets</Description>
  <Tags>Mirador Dataset Search</Tags>
  <Contact>mirador-disc@listserv.gsfc.nasa.gov</Contact>
  <Url type="application/atom+xml" 
       template="http://mirador.gsfc.nasa.gov/cgi-bin/mirador/collectionlist.pl?keyword={searchTerms}&page=1&count={count}&osLocation={geo:box}&startTime={time:start}&endTime={time:end}&format=atom"/>
</OpenSearchDescription>

The key here is the <Url> tag, which includes a template showing how to form a URL to execute a search for datasets. Simply replace the placeholders with the search criteria and fetch the resulting URL. THe search engine will return an Atom document with links to the granule-level OpenSearch Description Documents (see above).

Granule (File) search engine

This is probably the hardest aspect. The Granule (or File) search engine should support queries based on some sort of dataset identifier, time and space (if appropriate). Note however, that because of the way the 2-step search works, you can actually have different search engines for different datasets, say, if you have a specialized search algorithm for some data. Also, if your data are global, you need not offer the spatial part of the search. How does the client know which search engine, what dataset identifier to use and whether spatial criteria area allowed? Easy, it's in the OpenSearch Description Document template described in the following section.

The response to the query should again be in Atom form. Here is a partial example:

<entry><id>http://aurapar2u.ecs.nasa.gov/airspar1/Aqua_AIRS_Level2/AIRS2RET.005/2009/283/AIRS.2009.10.10.010.L2.RetStd_IR.v5.0.14.0.G09284012834.hdf</id>
<title>AIRS.2009.10.10.010.L2.RetStd_IR.v5.0.14.0.G09284012834.hdf</title>
<updated>2009-10-10T00:59:24Z</updated>
<link href="http://aurapar2u.ecs.nasa.gov/airspar1/Aqua_AIRS_Level2/AIRS2RET.005/2009/283/AIRS.2009.10.10.010.L2.RetStd_IR.v5.0.14.0.G09284012834.hdf" 
  length="2254522" rel="http://esipfed.org/ns/fedsearch/1.0/data#"/>
<link rel="http://esipfed.org/ns/fedsearch/1.0/metadata#" type="text/xml" title="Metadata" length="10000" 
  href="http://aurapar2u.ecs.nasa.gov/airspar1/Aqua_AIRS_Level2/AIRS2RET.005/2009/283/AIRS.2009.10.10.010.L2.RetStd_IR.v5.0.14.0.G09284012834.hdf"/>
<link rel="http://esipfed.org/ns/fedsearch/1.0/browse#" type="image/jpeg" title="Browse Image" length="10000" 
  href="http://disc.gsfc.nasa.gov/daac-bin/airs/displayPreviewImage.py?filename=AIRS.2009.10.10.010.L2.RetStd_IR.v5.0.14.0.G09284012834.hdf"/>
<link rel="http://esipfed.org/ns/fedsearch/1.0/netcdf#" type="application/netcdf" title="Same File in NetCDF Format" length="10000000" 
  href="http://aurapar2u.ecs.nasa.gov/airspar1/daac-bin/OTF/HTTP_services.cgi?FILENAME=%2Fdata%2Fs4pa%2FAqua_AIRS_Level2%2FAIRS2RET.005%2F2009%2F283%2FAIRS.2009.10.10.010.L2.RetStd_IR.v5.0.14.0.G09284012834.hdf&LABEL=AIRS.2009.10.10.010.L2.RetStd_IR.v5.0.14.0.G09284012834.nc&SHORTNAME=AIRS2RET&SERVICE=NetCDF&VERSION=1.02"/>
<georss:box>29.141, 48.0441, 3.8571, 24.5844</georss:box>
<time:start>2009-10-10T00:59:24Z</time:start>
<time:stop>2009-10-10T01:05:24Z</time:stop>

<summary type="html">format:HDF, start:2009-10-10T00:59:24Z, end:2009-10-10T01:05:24Z, size:2254522  (8)</summary>
</entry>

This is a more complicated picture than than dataset level search. Here are some of the key features to note:

  1. The first <link> element points to the data URL, identified by the rel value of "http://esipfed.org/ns/fedsearch/1.0/data#". These rel values are ESIP conventions for identifying Earth Science file types. This is the one link that is required.
  2. The second <link> element points to an external metadata file URL. (At this point, there is no standard for the metadata format.)
  3. The third <link> element points to a Browse image for the data. Note that this image is actually served on the fly, though it could be a static file.
  4. The fourth <link> element points to a netCDF conversion service for that file. This aspect of the convention is still in work and may eventually link up with the ESIP convention for servicecasting.
  5. Note also the <georss:box>, <time:start> <time:stop> elements, which give some key information to the client for presentation to the user.
  6. The <summary> is somewhat freeform at this point, though a convention may be defined in future for consistency sake

Now, how does the client know which search engine, what dataset identifier to use and whether spatial criteria area allowed? Easy, it's in the OpenSearch Description Document template described in the following section.

Granule Engine OpenSearch Description Documents

These documents are the key to making the link between the Dataset-level and Granule-level searches. They describe how to do a space-time query for a particular dataset. Here's an example:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="/xsl/frost.xsl"?>
<OpenSearchDescription xmlns:os="http://a9.com/-/spec/opensearch/1.1/"
          xmlns:geo="http://a9.com/-/opensearch/extensions/geo/1.0/">
  <os:ShortName>AIRX2RET.005</os:ShortName>
  <os:Description>Obtain a list of URLs:AIRX2RET.005:</os:Description>
  <os:Tags>AIRX2RET.005</os:Tags>
  <os:Contact>mirador-disc@listserv.gsfc.nasa.gov</os:Contact>
  <os:Url type="application/atom+xml" 
       template="http://mirador.gsfc.nasa.gov/cgi-bin/mirador/granlist.pl?dataSet=AIRX2RET.005&page=1&maxgranules={count}&pointLocation={geo:box}&endTime={time:end}&startTime={time:start}&format=atom"/>
</OpenSearchDescription>

Note that this URL template searches only for AIRX2RET.005 and no other dataset. Now, you don't have to have a single static OpenSearch Description document for each dataset (though you can). Some implementations may generate this on the fly, filling in the dataset requested in the appropriate location in the template.