Talk:SemanticServicesUseCases

From Earth Science Information Partners (ESIP)

Notes on Chris's search use case 20090109 -- Pfox 14:45, 9 January 2009 (EST)

Aerosol Search Use Case

Locate the access points for aerosol data or services that would be useful and usable for characterizing the extent of the ash cloud plume from the 2 May 2008 Chaiten volcanic eruption. Useful data are those that have information that can be brought to bear on the problem; usable data are those that are available in a form that can be used in my analysis framework.

Target products would include such items as:

   * MODIS L2 Aerosol product, Terra and Aqua
   * MODIS L3 Aerosol product, Terra and Aqua
   * MERIS Aerosol product
   * GOCART Aerosol model
   * CALIPSO Aerosol classification
   * OMI L2 Aerosol index
   * OMI L3 Aerosol index
   * AIRS Brightness Temperature difference (ch ? - ch ??)
   * MISR L2 Aerosol(?)
   * Parasol
   * Experimental CALIPSO aerosol sub-classification
   * GOES Image with analyst's assessment of ash extent
   * Hysplit model run 

Target services could include:

   * Reformatting
   * Subsetting
   * OGC WCS or WMS access
   * OPeNDAP access
   * On-the-fly virtual products
   * On-demand model runs 

Factors that go into the usefulness include:

   * observation vs. model - kinda have this
   * type of measurement (e.g., radiance, aerosol index, AOT/AOD, ...) - start
   * method of measurement (e.g., infrared, visible, UV) - need this
   * source of measurement (instrument or model) - yes (is it connected?)
   * processing algorithm - (production method) - yes
   * units - MUO
   * horizontal resolution - yes
   * vertical resolution - no
   * temporal resolution - yes
   * temporal "alignment" (e.g., averaged vs. synoptic vs. temporal progression, different types of data days) - no (selection based on science validity)
   * data product or service maturity (e.g., operational, validated research, provisional research, experimental) - no 

Factors that go into usability include:

   * data format - hell yes
   * spatial reference system (swath, grid; type of grid) - yes
   * access means (protocol; synchronous vs. asynchronous; machine-accessible or not...) - yes
   * access restrictions - could be

What is in the catalog and what is not:

   * observation vs. model - can be inferred
   * type of measurement (e.g., radiance, aerosol index, AOT/AOD, ...) - in
   * method of measurement (e.g., infrared, visible, UV) - inferred from instrument
   * source of measurement (instrument or model) - in
   * processing algorithm - (production method) - not in
   * units - not in
   * horizontal resolution - in
   * temporal resolution - most of the time, or inferred
   * vertical resolution - not so much (its camplicated)
   * temporal "alignment" (e.g., averaged vs. synoptic vs. temporal progression, different types of data days) - no (selection based on science validity) - not in
   * data product or service maturity (e.g., operational, validated research, provisional research, experimental) - sometimes in (e.g. GCMD - Chris to research)

Example queries: Which data holdings have vertical resolution of aerosols. Which data holdings have horizontal resolution less then n km. Which data retrieve ash aerosols as a separate species. Which data serves as a proxy for volcanic ash. Which data can be accessed directly via a URL (online). For this dataset does data actually exist for the time and location of the event.

Tasks for Rahul.

- clean up the data ontology and have a design (our scope; not the validated) version by Jan 27 - add esip_data as the namespace (for now) for all non-name-spaced classes - Luis to help with fgdc/gml/iso review so we can take out class defns related to fgdc/gml/iso - work with chris on this use case, esp. catalog entries to populate instances in the ontology

--Clynnes 21:23, 8 January 2009 (EST)

Notes on Mining/ Rahul's search use case 20090109 -- Pfox 17:51, 9 January 2009 (EST)

  • current DM system requires the user to construct a workflow based on little information about the workflow requirements and their components to interface with each other
  • SAM will have information about services and data and what the requirement of the services are in terms of data, which services will work on which data, etc.
  • unsupervised k-means clustering
  • dust storm detection algorithms or methodologies via data mining
  • data products MOD02*km (e.g. nasa_data:MOD021KM@Channel1) - MOD021km is a DataSet and DataCollection, channel1 is a DataField
    • what properties are important for this: only use data in regions of interest
    • note that channel is a shorthand terminology for spectral bands (with associated properties) - need both concepts but the spectral band is the conceptually more sound and channel refers to a particular instrument
  • channel subset, need to know if each Channel is a different DataField and that they are present (e.g. do this with MOD021km, not MOD02qkm), need to also determine which Channel at the particular resolution, often found by iteration using all channels (human selected for now)
  • reformat to ARFF - need to know input data type, and output data type (since they depend on which algorithm is used)
  • potential for space-time, qc, feature detection sub-selection to reduce/ optimize data (lots of considerations here but we defer this)
  • normalize - necessary so that a single dimension does not dominate the results
  • clustering - n-dim data, possible need for human input for number of clusters (facets: format, topology, feature extraction
    • label one of the inputs to determine that it will take raw or normalized, i.e. fitness for use/ purpose - with rules, we agreed that this is only needed for inputs (not for outputs) with the understanding that we may need to extend (i.e. add properties) to indicate that (for e.g.) data is normalized.
  • visualize - may need geocoding to overlay results on a map for e.g. (ARFF does not have any geocoding)
  • tag clusters - human-in-the-loop step - need input from Rahul to determine that annotations, etc. are required on the services and the data
  • save cluster tag information to apply to other data (granules of the same type - where the characteristics of type need to be determined)