ESB Data Quality Standards

From Earth Science Information Partners (ESIP)
Revision as of 10:55, January 6, 2013 by Jldavis (talk | contribs)

The following is a paper by Jennifer Davis generated as a part of http://slisapps.sjsu.edu/gss/ajax/showSheet.php?id=4943 (LIBR284: Seminar in Archives and Records Management, Electronic Records at the School of Library and Information Science at San Jose State University).

Geospatial Data Quality Metadata Standards

10 August, 2012

Introduction

The modern world is being inundated with data: 15 terabytes per day streaming from NASA’s Earth observing satellites1; data generated from 2777 tweets per minute (on average) (Farber, 2012); data from delicate instruments that can analyze 20,000 cells per second (BD Biosciences, 2003). We capture more data every two days than was recorded in all of human history, prior to 2009 (The Korn/Ferry Institute, 2011).

Data, like statistics, are deceptive. A number pair indicating a latitude and longitude seems to be a precise measurement, but instead it is two numbers that are subject to the vagaries of the measurement device’s accuracy and stability, the expertise of the programmer and tester that wrote the software to capture and save the numbers, the reliability of the hardare being used to store the numbers, and dozens of other attributes. Measurements of this constellation of characteristics are called data quality (DQ), and data quality is crucial information for the people who use data. “The costs of making incorrect science inferences based on faulty data can be substantial and far-reaching: errors can be subtle, inappropriate conclusions can go unchallenged for years in the literature, and follow-on research may be critically jeopardized” (Radziwill, 2006, p. 7). Roy (et al.) comments: “The correct interpretation of scientific information from global, long-term series of remote sensing products requires the ability to discriminate between product artifacts [errors] and changes in the Earth processes being monitored” (2002, p. 62). In fact, Roy (et al.) asserts “In many cases, MODLAND [land-oriented geoscience] products can only be used meaningfully after consideration of this [data quality] information” (2002, p. 62).

There are debates about what metrics constitute data quality; there is a movement to engage in “total data quality management” (TDQM) (Madnick, Wang, Lee & Zhu, 2009, p. 3) to try to improve data quality. And once data quality is measured, the data quality metrics (DQM) need to be stored such that they are easily understood by users, easily harvested and transferred between automatic data storage systems, and kept with the data that they describe, which means those metrics need to be put into a standard metadata format.

As with any other standards, coming to a consensus about this format is difficult. In this paper, business data quality metadata will be considered briefly, but the focus will be on geospatial2 data quality metadata standards, in order to try to narrow the field of focus. ISO standards will be discussed; they are international, as are geoscience data sets, and they are the de facto standards for the scientific community. The ISO standards for geospatial data quality metadata grew, in part, out of existing American standards; their history and progression will be discussed. Current efforts to customize ISO standards for use by NASA and other Earth observing agencies will be examined. Finally, the effectiveness of these data quality metrics will be analyzed. It will be shown that the geospatial data quality metadata standards are still being developed, and more research is needed as to the effectiveness of data quality metrics.

Analysis

Metrics

Before discussing how to record data quality measurements in standardized metadata, the measurements themselves should be understood. Definitions of data quality vary, but most definitions include roughly these elements: “accuracy, correctness, currency, completeness and relevance” (Wikipedia, 2012, July, “Overview”). To narrow the definition to the factors involved in geoscience data, the metrics could be defined as “lineage, geometric precision or positional precision, semantic precision or precision of attributes, completeness, and logical consistency” (Servigne, Lesage, & Libourel, 2006, p. 4). Xia has a more thorough list of useful metrics: accuracy, accessibility, authority, compatibility, completeness, consistency, currency, discoverability, integrity, legibility, repurposing, transparency, validity, verifiability, and visualization (2012). Servigne goes on to add that even the method of data quality measurement should be recorded: “the date of processing (temporal aspect), the evaluation method used (tested, calculated or estimated) and the population on which it was applied” (2006, p. 7). A much more detailed, non-science-specific list of potential metrics is provided by (Wang & Strong, 1996, p. 11), and a long list of metric categories can be found in (Radziwill, 2006, p. 12). There are plenty of data quality aspects from which to select appropriate metrics.

Once the quality is measured, however, those metrics must be recorded and kept with the dataset in order to be useful. This is a concept that seems to be missed by much of the “data cleansing” discussion and automated software available to business’ data warehouses. For example, Contact Zone, a cleansing application, boasts the ability to let customers “access reports for statistics on [their] data” (Melissa DATA Corp., n.d.) (as opposed to those reports being metadata attached to the dataset). If the data quality results are not stored with the dataset, as metadata, they could easily be separated and lost.

The data quality metrics should not be stored in any random format, however. Following a metadata standard ensures better understandability and interoperability with other datasets and other systems. For example, MODLAND describes, in a 2002 paper, how their data quality metadata consist of four possible two-bit codes kept with each data pixel and two quality flags that apply to an entire dataset, along with explanations for those flags. “The Science Quality metadata carry the most detailed quality information so that users can make informed choices when ordering products. The Science Quality Flag is set to one of seven valid states…and the associated Science Quality Flag Explanation describes the setting and provides supporting information” (Roy et al., 2002, p. 65). ISO 19115 (described below) was not available until 2003, and it is good that MODLAND went ahead and defined data quality metrics and created an internal metadata structure for them. However, because this structure is unique to MODLAND, it makes interoperability and automated harvesting more difficult for these datasets.

DIF

In 1987, a Directory Interchange Format (DIF) was created to fill the role that metadata would come to play in geoscience datasets. It was “a descriptive and standardized format for exchanging information about scientific data” (Olsen, 2012). The DIF was created for NASA’s Master Directory (NMD). “The DIF does not compete with other metadata standards. It is simply the ‘container’ for the metadata elements that are maintained in the IDN database, where validation for mandatory fields, keywords, personnel, etc. takes place” (Olsen, 2012). The DIF led the way with a specific quality element, one out of 36 elements. In order to stay modern, there is an XML schema definition for the DIF. The quality element definition and its XML equivalent are shown in Figure 1. (TBS)

DIF is still in operation at the Global Change Master Directory (GCMD), the next incarnation of the NMD, which is the clearinghouse for the Federal Geographic Data Committee (FGDC) (Olsen, 2012), a group created in 1990 by Office of Management and Budget (OMB) fiat (Federal Geographic Data Committee, 2006, January).

CSDGM

The original FGDC metadata standard was the Content Standard for Digital Geospatial Metadata (CSDGM), finalized in 1998 (NASA, 2004); the data quality section was based on the 1992 Department of Commerce Spatial Data Transfer Standard (Federal Geographic Data Committee, 2006, February). CSDGM is important: “CSDGM is required of all federal agencies and of projects/programs funded through US federal funds, and so it is quite widely used” (Swatson, 2008). The CSDGM definition is written in an obscure notational style that does not provide the same interoperability of an XML schema; an example is in Figure 2. (TBS) This notational style, and the standard itself, were complicated enough that a graphic interpretation was created to try to make it easier to use (Tucker, n.d.); the position data quality metrics portion of the graphic is shown in Figure 3. Accessibility issues aside, CSDGM has a much more detailed data quality section and in that respect it is a significant improvement on DIF.

The CSDGM does not, however, provide a format for the metadata. “The standard does not specify the means by which this information is organized in a computer system or in a data transfer, nor the means by which this information is transmitted, communicated, or presented to the user” (NASA, 2004). CSDGM users can provide extensions to the second version to fit their needs: “CSDGM Version 2 allows geospatial data communities to develop ‘profiles’ of the base standard. Many of these profiles have extended the base standard by adding metadata elements to meet their specific community metadata requirements. In particular, a proposed set of Extensions for Remote Sensing Metadata recently underwent public review” (NASA, 2004). Customizations might create problems with automated harvesting; in the case of the CSDGM, there is no defined format, so it is probably not as much of an issue as a customization to an ISO standard.

ISO Standards

There are multiple ISO standards that relate to data quality. ISO 8000, “Data quality” (Wikipedia, 2012, May), appears to be used more by the business community and is still under development. It is an “ISO Technical Specification that describes the fundamentals of data quality, defines related terms, and specifies requirements on both data and organizations to enable data quality” (Benson, 2007). ISO 19113, 19114, 19115, and 19138 are all related to geospatial data quality (Servigne, 2006, p. 24). ISO 19115 was previously ISO 15046-15 (Peng & Tsou, 2003, p. 286) and was developed based on the FGDC CSDGM (“Introduction”, n.d.). ISO 19130 extends the definitions in 19115 by defining “the metadata to be distributed with the image to enable user determination of geographic position from the observations” (ISO, n.d.). There is even an XML schema definition for ISO 19115 in ISO 19139 (ESRI, 2011), available online (Norwegian Mapping Authority, 2012); an example element in this XML format is shown in Figure 4. (TBS) Having an XML schema definition for ISO 19115 makes it more likely to be used.

ISO 19113, 19114 and 19138 are being combined into, and replaced by, ISO 19157 (European Commission Joint Research Centre, 2012). “ISO 19157 extends the DQ_DataQuality section of 19115 in several ways…The 19115 DQ_Element includes three kinds of information which can cause some confusion and can make appropriate repetition difficult. These problems are addressed in 19157 by splitting the DE_Element [sic] into three objects” (NOAA, 2011). Combining three of these standards into one may make it a little easier to figure out which one is needed to annotate a particular dataset; however, one is still left with at least three relevant standards to consider: ISO 19115, 19157, and 19139. The relationships between all of these ISO standards is illustrated in Figure 5. (TBS)

While it is clear that ISO is making an attempt to codify ways to report on geospatial data quality in metadata, the thicket of ISO standards that are related to it and the requirement of paying in order to even evaluate the standards make it difficult to know which standard to use.

Extensions to ISO Standards

In 2007, FGDC combined their existing 1998 CSDGM with ISO 19115, in cooperation with Canada and their pre-existing standard, and came up with the “North American Profile of ISO 19115:2003”, which is freely available for review (Brodeur, et al., 2007). Section 5.5 describes, in reassuring detail, the data quality portion of this metadata standard (Brodeur, et al., 2007, p. 48), to the level of defining elements such as “DQ_AbsoluteExternalPositionalAccuracy” (Brodeur, et al., 2007, p. 49), which is the “Description of the methods, procedures, conformance results or quantitative results, and date stamp of the positional measurement in the dataset” (Brodeur, et al., 2007, p. 66). An XML schema is not provided in this document, but there are Unified Modeling Language (UML) diagrams of the standard in the appendices, which is a standard notation (as opposed to the original description of the CSDGM). Figure 6 shows the data quality portion. (TBS)

Other agencies are finding the need to extend ISO 19115. Commenting on this, a scientist from NASA’s Earth Observing System, S. Berrick, states: “Right now our approach is to use ISO 19115…and augment that with a NASA ‘convention’, and then try to get other agencies to buy into it for Earth science. We [would] rather it not be a NASA convention, but a multi-agency or even multinational convention” (personal communication, 2012, August 5).

Effectiveness

How are the consumers of datasets using data quality metadata? Not enough research has been done. Fisher, Chengalur-Smith, and Ballou (2003) ran a study that evaluated the use of one data quality metric and found that managers were more able to use the metric successfully than novices. The study also indicated that the managers used the data quality metric inconsistently, even though in the study, only one attribute was provided and that was consistent across the tests. Price and Shanks (2011) echo that finding: “Individual subjects may say that the meaning of the tags is clear because they each have their own internal—even if erroneous—interpretation. This could lead to random error in when or how DQ tags [metadata] are used that impacts experimental reliability (i.e., repeatability) or validity” (p. 325), not to mention random error in decision making in a non-experimental setting.

These studies imply that when faced with an array of multiple quality factors, a user with little training and experience in interpreting them may not get much value from them unless they are glaringly unsubtle (such as, perhaps, an element declaring all data are missing or all data are corrupt in one dataset). It is clear that in order to make sure that the metadata standards that are being developed are useful, more studies need to be done about how data consumers use the data quality metadata, if at all. In fact, Price and Shanks state that their study “does not offer even limited support for the possible utility of DQ tags. The only evidence of DQ tag impact on decision outcomes—reduced efficiency and consensus—is clearly detrimental to decision making, thus contraindicating the general adoption of DQ tagging” (2011, p. 341). These two studies are geared more towards the business community; it could be hypothesized that science data users would have more understanding about data quality metadata, being perhaps more attuned to the issues of data quality, but that is a dangerous assumption. Also, a scientific remote sensing data set created by an instrument on a satellite cannot be revised in the same way that an incorrect zip code in an address data base can; business might have the luxury of foregoing data quality metadata in favor of simply improving data quality, whereby a scientific data set is a more static entity and measures of its quality a more necessary addition. Therefore it is critical that geoscience data users understand how to read and use data quality metrics effectively.

Shankaranarayanan and Zhu experimented with presenting data quality metrics in a visual interface (2012). They found, similarly to Price and Shanks (2011), that “when task complexity is high, QM [quality metrics] can negatively impact decision outcome due to cognitive load. The user gives up on cognitive effort and sacrifices decision accuracy” (Shankaranarayanan & Zhu, 2012, p. 1442). However, they found that a visual interface improved the usefulness of the metrics: “The visual representation of QM can help communicate the meaning of QM to users; reduce the mental demand to integrate QM into decision making, and consequently improve decision performance” (p. 1442). This implies that while storing data quality metrics should happen in a standardized metadata format, a tool to interpret those metrics might be useful.

Summary

Data quality metadata is not a particularly new field, but the lack of research and user awareness makes it feel like one. Business users do not seem to be aware of a need to record this metadata and store it alongside a dataset and may be content to have a separate report generated by commercial software running a standardized analysis. Earth science data providers are aware of the need but are grappling with ways to accommodate varied data quality requirements and differing agency traditional models to create usable, exchangeable metadata standards. ISO standards provide a useful starting point but do not seem specific enough to satisfy the detailed needs of real-world datasets. There is little to no research on the usefulness of geospatial data quality metadata that exists in the field; that research will be crucial to determining whether the standards are making a difference in data usability and in proper interpretation and analysis of datasets.

It is clear that the quality of scientific data is important. Metadata are the best way to record the quality information and keep it with the dataset, and standards are the best way to encode metadata. While little is known about how useful geospatial data quality metadata are to data consumers, it is a necessary component of creating good information from scientific datasets, and the solution may be to include user awareness and user training in the efforts to improve data quality, alongside efforts to standardize data quality metadata.

References

(Note that SJSU SLIS uses APA Style http://www.apastyle.org for its references.)

BD Biosciences. (2003, June). BD LSR II flow cytometer [Brochure]. Retrieved from http://facs.bio.indiana.edu/pdf/LSRIIBrochure.pdf

Benson, P. (2007, July 11). Integrating standards in practice [PowerPoint slides]. Retrieved from http://www.google.com/url?sa=t&rct=j&q=iso%208000%20integrating%20standards%20in%20practice&source=web&cd=1&ved=0CEQQFjAA&url=http%3A%2F%2Ftc3.iec.ch%2Fmeetings%2Ftc3%2F2008%2Fprague_co-ordination%2F3cm_prag03.ppt&ei=MWYjUNvFMOeBiwLjvoGYCA&usg=AFQjCNGfSwAa2bni6yu6x7QcFV2YfvXwsw

Brodeur, J., Habbane, M., Sussman, R., Rushforth, P. Shin, S., Danko, D. M.,…Westcott, B. (2007, July 26). North American profile of ISO19115:2003--geographic information--metadata (NAP--metadata, version 1.1). Retrieved from http://www.fgdc.gov/standards/projects/incits-l1-standards-projects/NAP-Metadata/napMetadataProfileV11_7-26-07.pdf/view

ESRI. (2011, November 14). Metadata styles and standards. Retrieved from http://help.arcgis.com/en/arcgisdesktop/10.0/help/index.html#//003t00000008000000

European Commission Joint Research Centre. (2012, April 24). ISO 19157 and ETS v5.2 2012. Retrieved from http://marswiki.jrc.ec.europa.eu/wikicap/index.php/ISO_19157_and_ETS_v5.2_2012

D Farber. (2012, June 6). Twitter hits 400 million tweets per day, mostly mobile. [Web log post.] Retrieved from http://news.cnet.com/8301-1023_3-57448388-93/twitter-hits-400-million-tweets-per-day-mostly-mobile/

Federal Geographic Data Committee. (2006, January 19). The Federal Geographic Data Committee: Historical reflections--future decisions. Retrieved from http://www.fgdc.gov/library/whitepapers-reports/white-papers/fgdc-history/?searchterm=historical%20reflections

Federal Geographic Data Committee. (2006, October 4). Data quality information. Retrieved from http://www.fgdc.gov/metadata/csdgm/02.html

Fisher, C. W., Chengalur-Smith, I., & Ballou, D. P. (2003). The impact of experience and time on the use of data quality information in decision making. Information Systems Research 14(2), 170-188.

The HDF Group. (2011, May 16). HDF-EOS project. Retrieved from http://www.hdfgroup.org/hdfeos.html

Introduction. (n.d.). http://webcache.googleusercontent.com/search?q=cache:M-r2bUECcycJ:geoconnections.org/architecture/technical/specifications/geodata_registry/introduction.html+iso+15046+csdgm&cd=4&hl=en&ct=clnk&gl=us&client=safari

ISO. (n.d.). ISO/TS 19130:2010. Retrieved from http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=51789

Jenner, L. (2012, May 4). Aqua: Part of the A-Train satellites. Retrieved from http://www.nasa.gov/mission_pages/aqua/

The Korn/Ferry Institute. (2011). The big data effect. Briefings on Talent & Leadership (Fall 2011). Retrieved http://www.kornferryinstitute.com/briefings-magazine/big-data-effect-0

Madnick, S. E., Wang, R. Y., Lee, Y. W., & Zhu, H. (2009, June). Overview and framework for data and information quality research. ACM Journal of Data and Information Quality, 1(1). 1-22.

Maurer, J. (2011, November). Overview of NASA’s Terra satellite. Retrieved from http://www2.hawaii.edu/~jmaurer/terra/

Melissa DATA Corp. (n.d.). Contact Zone. Retrieved from http://www.melissadata.com/dqt/contact-zone/index.htm

NASA. (n.d.). About us. Retrieved from http://gcmd.nasa.gov/Aboutus/xml/dif/dif.xsd

NASA. (2004). FGDC CSDGM. Retrieved from http://earthdata.nasa.gov/our-community/esdswg/standards-process-spg/heritage-standards/fgdc-csdgm

NASA. (2012). Quality. Retrieved from http://gcmd.nasa.gov/User/difguide/quality.html

NOAA (2011, July 11). ISO data quality. Retrieved from https://geo-ide.noaa.gov/wiki/index.php?title=ISO_Data_Quality#ISO_19157

Norwegian Mapping Authority. (2012, August 7). Data quality XML schema. Retrieved from http://www.isotc211.org/2005/gmd/dataQuality.xsd

Olsen, L. (2012). A short history of the Directory Interchange Format (DIF). Retrieved from http://gcmd.nasa.gov/User/difguide/whatisadif.html

Peng, Z.-R., & Tsou, M.-H. (2003, March 31). Internet GIS: Distributed geographic information services for the internet and wireless networks. Retrieved from http://books.google.com/books?id=sk5UHK-FJM8C&pg=PA286&lpg=PA286&dq=iso+15046-15&source=bl&ots=FvQx9qPSa9&sig=mpAbhy-dctov_frGIVZvi5_VDY0&hl=en&sa=X&ei=fDgeUP-GOKzWiAKz64DADA&ved=0CFUQ6AEwCA#v=onepage&q=iso%2015046-15&f=false

Price, R., & Shanks, G. (2011). The impact of data quality tags on decision-making outcomes and processes. Journal of the Association for Information Systems 12(4), 323-346.

Radziwill, N. M. (2006). Foundations for quality management of scientific data products. The Quality Management Journal, 13(2), 7-21.

Roy, D. P., Borak, J. S., Devadiga, S., Wolfe, R. E., Zheng, M., & Descloitres, J. (2002). The MODIS Land product quality assessment approach. Remote Sensing of Environment 83. 62-76. Retrieved from http://globalmonitoring.sdstate.edu/faculty/roy/QA_paper.pdf

Servigne, S., Lesage, N., & Libourel, T. (2006). Quality components and metadata. In R. Devillers & R. Jeansoulin (Eds.), Fundamentals of spatial data quality (pp. 179-208, 1-34 online). Retrieved from http://liris.cnrs.fr/%7Esservign/12_Servigne-EN-vSS.pdf

Swatson. (2008, February 21). FGDC Content Standard for Digital Geospatial Metadata (CSDGM). Retrieved from https://marinemetadata.org/references/csdgm

Tucker, R. (n.d.). Graphical summary of Federal Geographic Data Committee’s Content Standard of Digital Geospatial Metadata v. 2.0. Retrieved from http://www.mpcer.nau.edu/metadata/CSDGM.htm

Wang, R. Y., & Strong, D. M. (1996). Beyond accuracy: What data quality means to data consumers. Journal of Management Information Systems (12)4, 5-33.

Wikipedia, (2012, May 29). ISO 8000. Retrieved from http://en.wikipedia.org/wiki/ISO_8000

Wikipedia. (2012, July 24). Data quality. Retrieved from http://en.wikipedia.org/wiki/Data_quality

Wikipedia. (2012, August 3). Aura (satellite). Retrieved from http://en.wikipedia.org/wiki/Aura_(satellite)

Xia, J. (2012). Metrics to measure open geospatial data quality. Issues in Science and Technology Librarianship. doi:10.5062/F4B85627


Footnotes

1One terabyte per instrument on Earth Observing System satellites (The HDF Group, 2011): five instruments on Terra (Maurer, 2011), six instruments on Aqua (Jenner, 2012), four instruments on Aura (Wikipedia, 2012, August).

2The terms “geospatial” and “geoscience” are subtly different: “geospatial data” can refer to any location data, such as that captured on photographs taken by smart phones; “geoscience data” refers to scientific measuremenst of the Earth, such as the remote sensing data from Earth observing satellites. In this paper, “geospatial data” will be referring to the location information associated with “geoscience data”.