AQ 2007 10 31 Discussion

back to 2007-10-31 Workshop page

Data lineage tracking

Do we need to coordinate conventions for tracking provenance, even in readme files?
How do we recognize and track sources of variance (and their magnitude) within and across datasets throughout the processing chain?
- Take as an example the case x and y variance are not the same.
- Which processing tools compound error? How do we account for it?
How do we make apparent critical aspects of data interpretation as these aspects change along the processing chain? Which propagate?
- Do we need rules? How about conventions? Community organization to provide oversite and promote use, especially responsible use?
Air quality forecasting application
- plus special/exceptional events
Assessment of control programs application
- attention special/exceptional events

Focus: HTAP and exceptional events

Identify users in these applications
Connect the existing projects (Airnow Tech, RSIG, Giovanni, etc.)
- Exceptional events: Airnow Tech is becoming source for surface data
- HTAP: model evaluation is needed (e.g., RSIG, Giovanni)
How to support analysis and comparison of datasets
Create "interoperability network" and connect appropriate players
Lineage info required is driven by uses: How to capture this up front?

Data quality considerations

Can we list spurious sources of variance that must be taken into consideration as we visualize or composite datasets?
- Cloud cover
- Registration issues
  - sensitivity to neighboring measurements
- Instrument issues
  - resolution
    - types of interpixel contamination
  - interference (NOx, chemical, electrical)
  - "viewing" angle (each pixel, terrain issues)
- Timing issues
  - Time of day
  - Day of week
  - Lunar cycle
  - Special (car exhaust, sporting event, volcanic activity)
- Magnetic issues (no omi for Rio)
  - Raised as a confounding issue only known to remote sensing elders
  - How much of an effect is seen at other sites (besides Rio)

What is the best way to determine "best available evidence"?
- How do we know if a remote sensing product has been verified with ground data and in situations comparable to use?
How and why might we tag a dataset "bad data"?

Use Cases: Question-Driven Approach

Puget Sound Challenge

Highlight at EPA OEI Symposium Nov 14

Wildfire Scenario Southern California

How to advertise the needs and opportunities?

Beijing China Olympics

Door may be closed to participate ahead of time
High visibility for anything accomplished

How much NO2 is man made?

How would we provide data to justify the statement, "Man-made emissions of nitrogen oxides dominate total emissions"?

See http://www.apis.ac.uk/overview/pollutants/overview_NOx.htm

Thessolonika dataset

Perhaps useful to discuss formally Ernest Hilsenrath's Thessolonika slide

Compare appropriate use of "near real time" vs "standard product"

Provides example in which absolutes are not all there is
Would give experience in a limited context that might generalize
Has immediate effect on use and users

Is the air safe to breath?

Can't give satisfying answer: spatial/temporal variation
Is it getting better?
How do we get people used to tracking environmental indicators and In Situ monitoring to track and use the remote sensing data?
- Need to train and re-train experts and decision makers

Solutions

Technology problem

Statement was made that this may not be technology problem
But we don't have technology for capturing inference or causes
- we are barely handling ontology when we need epistemology, grammar, rhetoric, and dialectics
We don't capture emergent properties
- Rudy's frames provide slots for lots of the necessary ancillary data
- Frame-based reasoning is brittle and doesn't handle ambiguity
Compare "readme" files and wiki
- wiki can be more interesting and see larger audience participation
- wiki may provide too many structural decisions; needs hourly grooming and attention
How do we capture the chain of decisions involved in data collection, processing, and use?
- The "chain" begins before the sensor is designed and extends through many intermediaries out to each user in new contexts.
  - How do we tag data appropriately for an educator or policy maker?
  - Decisions are made between perceived choices. At the extreme, what if there are whole new sets of considerations that hadn't arisen at the time the mission was designed? What if they arise for some irreversible processing decision?

Social problem

How do we get people to use the technology
- Dedicate a person to capturing ephemera
- But there is an inestimable volume of such, it would be too expensive
Question may be, "How do we elevate those few really significant details?"
- What tools would facilitate this? Easy way to raise "level 3" objections or warnings
Different communities don't know enough to ask each other for the "right" things
- Need low level interaction to raise common ground
Sometimes more than a "chain" of intermediaries: a whole social network
- There is feedback even to the design of the next sensor
- Create a wire diagram of specific flows (chain diagrams)
- Examine tools from popular social networking sites (Orkut, deli.cio.us, Amazon, friendster, linkedin)

Systemic problem

Are there appropriate rewards for investment of time in metadata?
How do we control for bias, institutional and other?
- There are always reasons to "sell" aspects of what was done
- Everyone has a bias; courts require id for advocacy; should science?
- Science has an assumption of no unbiased judgments: is a non-rhetorical assumption valid?
  - data processing and presentation canbe/is rhetorical, advocating a decision
Are the right people at the table?
- ESIP Federation was established to assure that the right mix of experts were involved in data decisions.
- Can we expedite reviews using "federalism" (balanced interests)?

Useful analogies

Process for creating "data spaces" used for data sheets in education
- DLESE and NSDL Data Access Working Group
- Data sheets are one-page summaries giving the "vibe of the thing"
Stock Market
- Used to be little information available; every one used broker
- Now day traders can get high level realtime data
- Idiots get weeded out
Radar images in weather report
- Having Google Earth OMI visualizations scares folks
- It is not like weather; seems like only stupid people live in Northeast urban areas
- 2004 Sciamachy image got frontpage coverage showing European pollution sites; people knew red meant bad; raises whole issue of color scales and interpretation; AQI has regulatory requirement for color values
- Might better inform people who could drive change.
Ecosystem Model
- Data services are in an environment competing/cooperating
- Diversity is good to avoid collapse
- Valuable to have multiple overlapping products to compare

Opening Issues

What is the particular niche for this group?
- Comparison to science meeting
  - Was this more like a science meeting?

Can we segment topics so all parties participate in all discussions?

Is there a way to capture all the discussion that happened around presentations?

Special opportunities to build "knowledge base"
- Journal articles may not capture all available discussion

Could we test tools that would allow us to elaborate and apportion the significance of topics? Maybe we could use a wiki for this.

The comment was made that our measurements are "unstable", which elicited the comment, "Then what are we doing here?" How do we avoid defaming all our data with details? Certainly these satellite differences are significant. How can we state that while questioning the details of our understanding?

Opportunity to reach across communities
- Clearly an advantage of the ESIP Federation is the opportunity to put technology people together with data people, scientists, educators, and applied science people. How can we take best advantage of this opportunity?

It might be that we could use discussion tools to help create an effective knowledge base that communicates appropriately to different audiences. Policy makers need to see that there is a human effect, not that there is still some question which of the effects are the most significant.

Data/tool Decision Tree

wonder if we could set up table comparing products

might do by datasource, visualization tool, processing tool, etc.

Usage considerations
- Legal
  - EPA always has to defend its judgments in court
  - Court needs "preponderance of evidence" or "beyond a reasonable doubt"

This is quite different from 0.95% certainty.

- Science
  - Needs detailed information about sources and models used in "correcting" data
- Education
  - Special considerations for "real-time" data
  - Special considerations for hiding and introducing complexity
How can we leverage the NO2 work as a pathfinder to accelerate our collective capacity to use data from new platforms such as the Orbiting Carbon Observatory (OCO) targeted for launch in 2008 or NPOESS?