Mangus: Description of EPA-OAQPS and HEI systems White Paper

From Federation of Earth Science Information Partners

Back to <Data Summit Workspace
Back to <Community Air Quality Data System Workspace

Access to EPA’s Air Quality Data for Health Researchers

Questions on this draft white paper should be directed to Nick Mangus, EPA/OAQPS, [email protected], (919) 541-5549.

Introduction

A common refrain from policymakers, analysts, and scientists is that obtaining the air quality data which they need is a challenge. This paper outlines the current collection and dissemination framework for air quality data and poses “charge questions” to the health-research/ epidemiology community. The answers to these questions will help us at the EPA improve our offerings.

To frame the charge questions, this document describes a relatively new EPA system, the AQS Data Mart, and contrasts it with the HEI Air Quality Database, which was put in place to provide access to PM components and other data for health researchers. Finally, the charge questions are presented.

Background

The collection, storage, and dissemination of air quality data is a complex process achieved by a series of separate groups of hardware, software, and people. As technology has advanced and the number of distinct sets of user groups (those with different data or analytical needs) have proliferated, the problem for any individual finding precisely what they need has only gotten more complex. Adding to this complexity are intermediate “value-added” providers who may integrate, visualize, or otherwise post-process data from various sources. Thus, users can invest in their own data gathering and processing or they can rely on an array of intermediary providers. We also have data from special studies. The quality is (probably) high, but the data may not be readily available to others. So, EPA will always be the provider of certain base data, but we may not have it in the desired form, integrated with other desirable data (emissions or population), or presented in the desired manner. There will always be the possibility for a value-added provider to enhance the EPA data or integrate it with other data.

The following diagram is a simplified view of the components that accomplish the collection and dissemination tasks at the EPA. It will be used to explain how data are collected, stored, and provided by EPA and how the HEI acts as a value-added post-processor.

AQS Flow Diagram.png

The main part of the diagram shows the major components of the EPA’s Air Quality System (AQS). Beginning from the left hand side, samples are collected in the field by monitors. Some of these samples are analyzed in situ, others are collected by the State, tribal, or local agency responsible for the monitor and analyzed at laboratories. Either way, the agency responsible for the monitor is also responsible for ensuring the measurements are reported to AQS. It should be noted that only monitors within the EPA national ambient air quality monitoring network must have their data reported to AQS, for other monitoring networks or special studies (e.g., The Texas PM2.5 Sampling and Analysis Study) it is optional and the information may be stored in another system (e.g., NARSTO).

AQS is the EPA system designed to collect and store the monitored information. When users are allowed unlimited access to download information from such collection systems, the demands put on the system by voluminous requests can compromise the ability of the system to fulfill its collection function. To alleviate this problem, software engineers developed the AQS Data Mart which stores a copy of the information from the AQS and allows users to download data. It is a generic “retrieval” tool that provides the ability to query any information, but it does not provide significant data exploration or analysis capabilities. These capabilities are left to downstream “value-added” tools. EPA is in the process of transitioning our user applications designed for downloading information from the AQS database to the AQS Data Mart database. The right hand side of the diagram represents the several places to query or download air quality information that EPA provides. Each has been targeted to a specific audience: the general public, data analysts, or researchers. The diagram indicates which ones are still connected to AQS and the ones that have been transitioned to the AQS Data Mart. Note that the small cylinders by three of the systems still getting their data from AQS indicate that they must copy data and store it separately so as not to impose large loads on AQS. One of the advantages of using a data mart is to alleviate the need to store these data again.

As an example, raw PM2.5 data collected by EPA is available to external users in three of these EPA “front-ends”. Large text files can be downloaded from our website (The TTN Data Page at http://www.epa.gov/ttn/airs/airsaqs/detaildata/downloadaqsdata.htm). The AirExplorer site can be used to query, plot, and map these data. Finally, the Data Mart Direct Interface can be used to query the data. Each of these tools has advantages and disadvantages depending on the needs of the user. For more information about all of the front-ends listed in the diagram, please see Appendix A.

Beyond AQS and the related EPA systems, there are many other stakeholders involved in the collection and dissemination of air quality data, each with their own activities and possibly systems. AQS is likely the largest repository, but there may be additional information of interest to health researchers stored in other places. These additional stakeholders are represented by the other “layers” in the diagram. Elsewhere in EPA there are data collection and dissemination systems (CASTNET and AirNow in the Office of Air and Radiation; RSIG and PHASE in the Office of Research and Development; and Environmental Geoweb in the Office of Environmental Information). Additionally, EPA has other systems that present public and management views of air quality data.

The next layer out represents EPA partners, those who operate in cooperation with EPA, like the Health Effects Institute, Colorado State University, etc. who maintain data dissemination systems (many that integrate data from outside of AQS). Also in this layer are special studies (DEARS, NMMAPS, etc.) that manage the full lifecycle of air quality data management from collection to dissemination. Generally these non-governmental partners and EPA communicate with each other and the action that one takes may influence the other. Considering again the PM2.5 example, the HEI Air Quality Database uses the EPA provided data for PM and the nearest gas phase monitors, and integrates EPA emissions and non-EPA population and meteorological information. This is a value-added service to provide a custom-tailored solution to a specific community. Finally, there is the layer entitled “Others,” which represents those stakeholders who operate independently. These are the “unknown unknowns” in terms of additional data that may be collected or made available. Each of these groups brings with them a different list of what they can do easily, what they can do with difficulty, and what they cannot do. That is, each provides a degree of flexibility or constancy that makes them the best at providing a particular product or service. Collaboration, building on the strengths of each organization, is critical and one organization may have to take up the role of integrator and communicator so the research community knows where to get vital information. That is, if a clearinghouse listing all available databases, datasets, and access systems is needed, someone will have to manage its creation and maintenance.

The remainder of this paper discusses only one EPA access mechanism, the AQS Data Mart Direct Interface, which was designed specifically to address the needs of the research community. EPA perceived these needs as primarily the ability to locate and extract large sets of data. The Data Mart was made available for internal EPA use in mid-2006 and for external use, along with the Direct Interface, in early 2007. Use has been growing steadily since then. Overall, it has been well received by most of those who have accessed and used it. Initially a pilot project, the reaction from users has been positive enough that EPA management has committed to ongoing support for the system. Most of the negative reaction falls into two categories: the user friendliness of the system and the documentation of the data. To address the first, we continue to add features and improve usability to make the Data Mart as friendly as possible to the research community. Documentation of the data is not a problem inherent to the Data Mart, but we realize it is much needed, so we are also addressing this as we can. The remainder of this paper will introduce the Data Mart Direct Interface, compare it to the HEI Air Quality Database, and place “charge” questions to the research user community to help us continue to improve these systems to meet your needs.

Contents of the Data Mart

The Data Mart contains every measured (“raw”) and aggregated (“daily and annual summary”) value reported to AQS from January 01, 1980 to the present. It also contains all of the same site and monitor descriptive data and measurement metadata in AQS. We have converted most data-entry codes to plain English words to help with the interpretation of downloaded data. There are no additional quality assurance steps performed on the data in the Data Mart, as the data in AQS are generally considered to be of the highest quality. Data must undergo many quality control steps as part of the loading process before it is saved in the AQS database. Likewise, submitters are required to assure that the monitor is operating properly and has passed precision and bias checks before loading the data. Finally, each year, EPA and the submitter review the data for completeness and correctness before the data are “certified” for regulatory use. It should be noted that IMPROVE (visibility network) and SANDWICH (modeled PM2.5 species) data are not generally reported to AQS. However, EPA staff has recently loaded the IMPROVE data for 1988-2005 into AQS and the loading of SANDWICH data is planned. As of January 14, 2008, there were 1.67 billion raw measurements for 885 different parameters in the database (there is a profiling spreadsheet under the documentation section of the web page). The Data Mart is refreshed from AQS each weekday night, so it always has the latest available information. However, since data up to 4 years old can be submitted to AQS at any time, and there are special windows for “historical” data updates, any of the contents can change at any time. That is, there is no freezing or snapshotting of data into a static version in the database.

Accessing the AQS Data Mart

The AQS Data Mart can be accessed by visiting the webpage, http://www.epa.gov/ttn/airs/aqsdatamart, and following the “Access” link. Registration is required, and a user ID and password needed for access. You may sign up for your own account or use a guest account with user = [email protected] and password = AQSdatamart1 (case sensitive). Access is provided by an application that you can either run in your web browser or download and run on a PC. The application is used to submit a query. A query lets the user select the geography, substance (parameter), time, metric, and optional data to return. The Data Mart currently has five queries, summarized below.

Query

Description

Values

Recommended, returns any single raw, daily, or annual variable with metadata and is very efficient

Monitor 

Returns descriptions of the monitoring site and equipment

Annual Summary

Returns all annual summary aggregate statistics for the monitors selected

Raw Data 

Returns raw data in the AQS transaction format - recommended only for AQS users

Sites by Threshold

Returns a list of sites that meet a specific data-related threshold that you specify

When the query is complete, results can be downloaded using the application or by following a link in an email message sent to the user. All output is in XML format, but with embedded links to stylesheets for user-friendly display.

The Data Mart is intended as an extraction system only and EPA does not plan to provide analytic or graphical capabilities with the Data Mart. However, some of the other tools that EPA provides do have these capabilities (see Appendix A for details).

Contents of the HEI Air Quality Database

In September 2005, a group funded by the Health Effects Institute (HEI) and led by Christian Seigneur and Betty Pun at Atmospheric and Environmental Research (AER) launched a website/database to facilitate health effects studies that require detailed knowledge of air pollutant levels and other relevant information at selected sites across the US. The HEI Air Quality Database combines information on PM2.5 components collected at monitoring sites in the Chemical Speciation Network (CSN); meteorological variables; and levels of gaseous pollutants (SO2, O3, NOx, and CO) from monitoring sites at or near each CSN site. Metadata are provided for each monitoring site, such as its geographic coordinates, state, as well as county, city location information, population, and emissions data for nearby point, area, and mobile sources. AER updates information in the HEI Database every few months and is currently funded to do this through 2008.

Accessing the HEI Air Quality Database

The HEI Air Quality Database can be accessed by visiting the webpage, http://hei.aer.com. Once you obtain an account by following the instructions on this page, you can access the site browser and list building, database queries, and users’ guides. The general data retrieval process consists of four steps: browsing sites, defining and saving a list of sites, extracting data for the sites in a saved site list, and, downloading the extracted air quality data.

Comparison of AQS Data Mart and the HEI Air Quality Database

The HEI Air Quality Database represents a value-added service over what EPA provides for a scientist looking for specific speciated PM2.5 data to evaluate in health research studies. So, a natural starting point for such a user would be the more tailored HEI system. If, however, that system does not have some particular information that the user needs, they can revert to using the EPA system. The EPA system is broader, but less refined; the closer the user gets to the source, the more raw material they must process to get a finished product. The following table compares some of the features of the HEI Air Quality Database and the AQS Data Mart to illustrate some of these trade-offs.

Feature

HEI Air Quality Database

AQS Data Mart

Site browser

Yes, with maps to help

No

Site finder

Yes, with multiple-variable filter

Yes, via a single-variable “sites by threshold” query

Query from saved list

Yes.  Station lists may be saved and re-used

No.  Query based on geography and parameter or single site

Query by any geography

Yes

Yes

Air quality data for PM2.5, O3, CO, NOx, NO2, & SO2

Yes

Yes

Air quality data for all other parameters

No

Yes

AQS met data

Yes

Yes

Integrated non-EPA met data

Yes

No

Emissions data

Yes

No

Census data

Yes

No

On-line help

Yes

No

Off-line help

Yes

Yes

File format

CSV

XML (CSV planned)

Data returned in one file

No

Yes

Update frequency (versions)

Quarterly

Daily

Build your own query

Yes

No

To summarize the key differences:

  • The HEI interface is more tailored to the PM2.5 analyst.
  • The HEI interface contains emissions, census, or NCDC meteorology data, the Data Mart does not.
  • The Data Mart contains all ambient data reported to AQS (not just PM, meteorological, and NAAQS gases).
  • The Data Mart only contains special studies data (e.g., supersites) if it has been loaded into AQS.

Interpreting the Data

Between data element names, report headings, and data transfer formats, there are almost 2000 named data elements relating to air quality that EPA makes available. In addition, some of the values in those fields need individual documentation to properly describe them (for example, what is the difference between a SLAMS and a NAMS monitor type). To help the user identify and perhaps understand the data they have, the EPA created an annotated, cross-referenced index called the “Field Guide to Air Quality Data”. It is available in the documentation section of the Data Mart web page. There is also a list server that can be used to ask questions or monitored for system status. Charge Questions - Introduction

To help prioritize and define future activities so that we can better meet the needs of the members of the research community, EPA has compiled a list of “charge questions” for invitees to this conference to consider. The overarching issue is connecting the data users to the data providers. For EPA and our partners to improve on this, we need to fully understand the data needs of the health research community. The more specifically the needs can be elucidated, the more concrete actions that can be taken to improve the situation. We are interested in feedback from users and potential users of air quality data and retrieval tools. This paper is concerned only with access to existing data; possible new data collection activities are covered elsewhere.

Standout Charge Questions

In previous interactions with data users and the health research community, three questions repeatedly come to the forefront as seemingly ubiquitous and critical. These issues are also at a high level and decisions on them will potentially impact decisions on the other charge questions. To complicate matters, there is not a single unifying idea that all agree is progress in the right direction on these issues. Thus, these questions are presented in more detail and with possible solutions to initiate discussions.

  1. Data versioning/snapshotting: How often should EPA release data and how should we indicate that it has changed? The EPA, HEI, and others currently provide data via many applications. The data in those applications are generally updated on a schedule or as new data become available. For example, the AQS Data Mart is updated every day with new submissions and changes to AQS. However, new data or changes coming into AQS may be 10 years old. So a value in the AQS Data Mart representing a sample taken in the late 1990s may change today. Likewise, the HEI Air Quality Database is generally updated as the EPA makes new AQS “flat file” data extracts available on our web sites. This is usually done quarterly and without notice, thus the HEI database changes about quarterly; and the same 10 year rule applies. The key difference is that if you get data from the AQS Data Mart and your colleague gets the “same” data the next day, the data may have changed. If you are using the HEI database, the data may also have changed in one day, but the odds are less and the data vintage is clear in the “about” pages of the website. The stability of data for verifying and comparing research is essential, so the charge question is this: How often should EPA release data and how should we indicate that it has changed? One solution to this issue is to only make new data available outside EPA once per year. These data would be released on Independence Day and would be up-to-date through the prior year. This option provides greater stability to the data but may not be timely enough for particular studies or NAAQS revisions. A second solution is for EPA to continue to release data as it is received. Each value would be date-stamped with the date it last changed along with the date it represents. This allows for comparisons of data sets but requires more data to be downloaded and analyzed by the user. There are many intermediate options that could be implemented.
  2. Topic-focused portals: Are topic focused portals needed for air quality data? If so, what should those portals be and what should they contain? A strength of the HEI Air Quality Database is that it is geared towards health researchers evaluating speciated PM2.5 data and the user interface provides tools and information specifically targeted to this user. The AQS Data Mart, on the other hand, is generic and targeted at anyone wishing to download air quality data. An annotated map of the PM2.5 speciation sites on the HEI page helps the user understand and find the data they need. An analogous map of all 5,000 sites represented in the AQS Data Mart would only overwhelm and confuse users. Custom tailored “portals” into data, like HEI’s, are very helpful to the user, especially when they have an interest limited to less than everything available. The EPA is reasonably good at providing data but is often constrained in the technology we can use to provide descriptive and analytical tools. Likewise, we are sometimes not able to quickly secure funding to add tools to respond to developing areas of interest. This may be a place where the flexibility of external organizations can be used to provide a more custom, and therefore useful, experience. Are topic-focused portals needed for air quality data? If so, what should those portals be and what should they contain? For example, there could be portals specific to PM2.5 speciation, ozone and precursors, toxics, organic compounds, etc. Given the new technologies, a portal that resides outside of the EPA can have live access to a single, consistent, stable database within EPA.
  3. Accessibility of non-AQS data: The AQS Data Mart stores data from the national ambient air quality monitoring network(s) and, as previously mentioned, has recently begun to add some data from other networks and “special studies.” Is it important to have access to data from local, short-term, air quality special studies? Examples include MESA-Air, DEARS, Supersites, and ultrafine particle projects. If these data should be included, how should it be done? For example, to be loaded into the AQS Data Mart the data must match the monitor paradigm (no remote sensing or mobile monitors), it must meet format and quality requirements, and it must have associated descriptive data (e.g., method used, sampling schedule). Getting new data to match EPA’s data standards are often labor-intensive activities – are they worth it? Would EPA have to correct and load these data into the AQS database (or a “research” copy of the AQS database)? Would EPA be able to commit the resources to doing this? As an alternative, EPA can provide information to managers of new studies about the data format and content standards we have so that the data can be collected in a way that could be more easily shared and compared with AQS data or other new data collected using the standards. If this special study data remains outside of EPA systems, is there a role for a clearinghouse? The clearinghouse could keep an up-to-date list of monitoring efforts, databases, contents, and appropriate uses. Issues to be considered include: how resource intensive would this effort be and who would develop and maintain this clearinghouse?

Other Charge Questions

The remaining charge questions are more straight-forward than the standout charge questions. They are related to how individuals gather and use data rather than community-wide concerns.

  1. What are the key data that you need? Is any of this currently collected but not available?
  2. Is there a particular way that you need data organized, grouped, or formatted?
  3. What data elements other than measurements do you need?
  4. What is the typical domain of the data you need (time, space, and parameter selections; for example, 3 years, several cities, and 4 parameters; or 1 year, national, 44 parameters)?
  5. Are there “profiling” reports – descriptions of which sites collect which data, how complete the data are, etc. – that you need?
  6. Would you rather query a database or have a large list of files that you can select from to download (like http://www.epa.gov/ttn/airs/airsaqs/detaildata/downloadaqsdata.htm but with more geographic resolution)?
  7. What would your ideal query builder/interface look like?
  8. Are there pieces of data that we provide or questions that we ask that confuse you?

APPENDIX A – OTHER DATA ACCESS MECHANISMS

EPA has many places to access air quality data. Each of these websites or applications was designed for a specific target audience, for example, the general public concerned with acute health issues, the general public concerned with long-term air quality where they live, the general public interested in air quality comparisons between multiple locations (for living, vacationing, etc.), data analysts concerned with regulatory compliances, data analysts contributing to policy decisions, and health researchers. We consider a researcher to be someone who is looking to download raw data; either in large volume or in small, discrete sets that are difficult to tease out of large published datasets. Each of these websites or applications presents a unique front end for queries, charts, or maps that are geared toward their target audience.

EPA is developing a “portal” to list all of the sources of air quality (and emissions) data that are available and link directly to their access pages. This portal is at the following web address: http://www.epa.gov/oar/airpolldata.html. Below is a table comparing key information about each of the available EPA-maintained systems for air quality data (including AirNow and CASTNET which contain data not in AQS). The systems are described at the link above. (Key: a filled circle means “yes” and an empty circle means “some”.)
EPA Systems Comp.png

Mangus: Description of EPA-OAQPS and HEI systems White PaperCommunity Air Quality Data System Workspace Data Summit Workspace2008-04-09

Mangus: Description of EPA-OAQPS and HEI systems White PaperMangus