ISO Data Quality

From Earth Science Information Partners (ESIP)
Revision as of 14:47, January 6, 2017 by Ted.Habermann (talk | contribs) (→‎StandAlone Report)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

A principle goal of metadata is to ensure that the data they describe can be independently understood and used effectively. Data quality tests and reports play a critical role in achieving this goal. Connecting these to the metadata record is clearly important.

The approach to including quality information in ISO 19115 metadata records is different that previous approaches and much improved. It includes the capability to include descriptions of unexpected behaviors and anomalies and the tests that are used to identify them. Understanding how to take advantage of this flexibility and implementing systems that maximize the value that this capability provides will certainly be a challenge for the environmental data community.

The Data Quality Section of the ISO standard supports flexibility at several levels. The DQ_Data_Quality object (see Figure) includes three sections: scope, lineage, and element. A metadata record can have any number of associated DQ_DataQuality objects.

Structure

Quality Metadata At A Glance
Data Quality Objects

The ISO Data Quality Metadata Standard (ISO 19157) fits on a single page (see Quality Metadata At A Glance) but it includes many different objects. The important conceptual objects are abstract and occur near the center of the Figure (DQ_Element, DQ_EvaluationMethod, and DQ_Result). The Data Quality Objects Figure shows a simplified view of the standard with more detail for some objects. Each of these abstract objects can be implemented in different ways depending on the specific approach used for evaluating data quality. These are described below.

DQ_Scope

Many datasets cover broad geographic regions, long time periods, and are made up of many related pieces. The quality of the data and the methods for determining it can vary over all of these dimensions. Therefore, the metadata for describing data quality must be flexible enough to describe quality results for many subsets of a dataset. Every ISO record can reference any number of DQ_DataQuality objects, each of which gives quality information for a specific subset. Each object must have an associated MD_Scope that describes the subset. The MD_Scope includes a codelist that can have one of twenty-six possible values. It is accompanied by the Level Description that provides additional information about the scope of the quality report. For example, if the scope code indicates that the report covers attributes, the level description describes specifically which attributes are covered. Finally, the Scope object includes a spatial and temporal extent which turns out to be very powerful. It allows data stewards and users to describe quality reports at almost any conceivable granularity within a dataset. The reports can vary with time and with space within a dataset, they can be associated with a certain type of feature or attribute within a dataset, or with specific features, or they can be associated with specific collection sessions, hardware or software.

The same Scope Code occurs in the ISO Metadata and MaintenanceInformation objects.

Data Quality Reports

The central element of data quality element is the report. It includes four major sections: the date of the test, the measure used to measure quality, the method used to apply the measure, and a description of the result of the evaluation. There are a number of different types of reports that reflect general areas of data quality, e.g. DQ_Completeness, DQ_LogicalConsistency, DQ_PositionalAccuracy, DQ_ThematicAccuracy, DQ_TemporalAccuracy, DQ_QuantitativeAttributeAccuracy, and DQ_Usability. These are very general descriptors that should be used if clearly appropriate. At the same time, some of these names were created with collections of geographic features in mind and so may not make sense in the context of other data types. In those cases, the very general type DQ_QuantitativeAttributeAccuracy serves as a generally applicable catch-all.

Data Quality Measures

DQ_MeasureReference [0..1]
+ measureIdentification: MD_Identifier [0..1]
+ nameOfMeasure: CharacterString [0..*]
+ measureDescription: CharacterString [0..1]

Data Quality Evaluation Methods

DQ_EvaluationMethod [0..1]
+ dateTime: DateTime [0..*]
+ evaluationMethodDescription: CharacterString [0..1]
+ evaluationProceedure: CI_Citation [0..1]
+ referenceDoc: CI_Citation [0..*]
+ evaluationMethodType: DQ_EvaluationMethodTypeCode [0..1]

Data Quality Results

The ISO Standard include four types of data quality results in the DQ_Element object. All types include some basic information:

DQ_Result [1..*]
+ dateTime: DateTime [0..*]
+ resultScope: DQ_ScopeCode [0..1]

The first, the DQ_ConformanceResult, describes how the dataset was tested for conformance to a published standard and whether the dataset passed the test. The second, the DQ_QuantitativeResult, provides a mechanism for describing the results of a quantitative quality evaluation. The third, the QE_CoverageResult, allows a quality result to be expressed as a spatial object. For example, if a gridded dataset has an associated grid of quality flags, that quality grid could be described here. Note also that the coverage results have an associated file and format. Finally, the DQ_DescriptiveResult, provides a simple text description of the quality result.

DQ_ConformanceResult
+ specification : CI_Citation
+ explanation : CharacterString
+ pass : Boolean
Or
DQ_QuantitativeResult
+ valueType [0..1] : RecordType
+ valueUnit : UnitOfMeasure
+ errorStatistic [0..1] : CharacterString
+ value [1..*] : Record
Or
QE_CoverageResult (added in ISO 19115-2)
+ spatialRepresentationType: MD_SpatialRepresentationTypeCode, one of vector, grid, textTable, tin, stereoModel, video
+ resultFile: MX_DataFile
+ resultFormat: MD_Format
+ resultSpatialRepresentation: MD_SpatialRepresentation
+ resultContentDescription: MD_CoverageDescription
Or
DQ_DescriptiveResult
+ statement : CharacterString

DQ_ConformanceResult - Tracking Compliance With Standards

The importance of standards in data integration is well know and broadly acknowledged. It is also well known that compliance tests will be required so that developers can systematically measure compliance with standards that evolve with time. Of course, these tests can only useful to users if the results are available and understandable.

There are two ISO Standard objects that apply to compliance reports. The first is the DQ_DataQuality object. This object associates a DQ_Element with a service, a dataset, or some subset of a dataset described by the DQ_Scope. The DQ_Element references a test and an evaluation procedure that produces a DQ_Result at a particular time. There are three types of DQ_Results, but the DQ_ConformanceResult is the type that is relevant here. It references the specification being tested, explains the meaning of conformance, and gives a Boolean result. The second ISO object that could be related to conformance testing is the MD_Usage object. This provides a mechanism for incorporating user problems into the metadata for a dataset or service. The object includes a description of the specific usage, the problem encountered (the limitation), the date and the user contact information. In the case of standards compliance, the usage would be attempting to access the dataset or service in a way that is consistent with an advertised standard and the limitation would be finding the dataset to be non-compliant with the advertisement. These problem reports would be available from the archive and, hopefully, would motivate data providers to be careful about advertising services that they could not support.

MD_Usage
+ specificUsage : CharacterString
+ usageDateTime [0..1] : DateTime
+ userDeterminedLimitations [0..1] : CharacterString
+ userContactInfo [1..*] : CI_ResponsibleParty

StandAloneQualityReport

ISO 19157 recognizes that important data quality information can exist outside of the conceptual framework of the model and that it may be helpful to provide that information as a supplement to the metadata. A DQ_StandaloneReportInformation class was added to enable connections between the metadata and these standAloneReports.

DQ_StandaloneReportInformation
+ reportReference: CI_Citation
+ abstract: CharacterString

How might these new types of reports be used? The environmental data community is using more and more standard formats for data files. The conformance test might indicate how well the dataset conforms to those file formats or associated conventions like the Climate Forecast conventions for netCDF.

Coverage Results

The quality of many datasets varies spatially within the dataset. For example, gridded satellite datasets include grids that provide quality flag values for every pixel. Radiosondes in the atmosphere and profiles in the ocean also can have quality that varies along their paths. In situations like these, the quality information can be described using a spatial feature like a grid or a line. In ISO quality information in spatial features is described using a coverageResult.

File:CoverageResult.png
Coverage Result

The coverageResult includes MD_SpatialRepresentation and MD_CoverageDescriptions that describe the quality coverage. In this case, the MD_CoverageDescription can serve a role that is similar to the role of the measure and evaluationMethod elements in the report.

The MD_CoverageDescription includes a contentType that can currently be image, thematicRepresentation, or physicalMeasurement. The revision of 19115 may add qualityInformation to this codelist and allow multiple contentType values.