RDA Big Data/Analytics

From Earth Science Information Partners (ESIP)

Big Data Definitions

Gartner’s big data definition - “Big data” is high-volume, -velocity and -variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making. Variety: companies are digging out amazing insights from text, locations or log file ( Multi sensor data for science) Velocity is the most misunderstood data characteristic: it is frequently equated to real-time analytics. Yet, velocity is also about the rate of changes, about linking data sets that are coming with different speeds and about bursts of activity (Data fusion issues - time, space, resolution issues) Volume is about the number of big data mentions in the press and social media.

Jim Frew:

My favorite definition is: You can't move it---if you want to use it, you have to go where it is (kind of like a pipe organ...)

For data generally, that means it has to be housed in a system that can do everything you'd want to do to it.

I.e. you send the problem to the data, not v.v.

For science data specifically, this means the data has to live in a reasonably complex processing environment.


Science Use Cases

Use Case 1: Event Analysis ( Dr. Tom Clune, Dr. Kuo GSFC/NASA)

An Earth Science event (ES event) is defined here as an episode of an Earth Science phe- nomenon (ES phenomenon). A cumulus cloud, a thunderstorm shower, a rogue wave, a tornado, an earthquake, a tsunami, a hurricane, or an El Niño, is each an episode of a named ES phe- nomenon, and, from the small and insignificant to the large and potent, all are examples of ES events. An ES event has a finite duration and an associated geo-location as a function of time; it is therefore an entity in four-dimensional (4D) spatiotemporal space. The interests of Earth scientists typically rivet on Earth Science phenomena with potential to cause massive economic disruption or loss of life. But, broader scientific curiosity also drives the study of phenomena that pose no immediate danger, such as land/sea breezes. Due to Earth Sys- tem’s intricate dynamics, we are continuously discovering novel ES phenomena. We generally gain understanding of a given phenomenon by observing and studying individ- ual events. This process usually begins by identifying the occurrences of these events. Once rep- resentative events are identified or found, we must locate associated observed or simulated data prior to commencing analysis and concerted studies of the phenomenon. Knowledge concerning the phenomenon can accumulate only after analysis has started. However, except for a few high- impact phenomena, such as tropical cyclones and tornadoes, finding events and locating associ- ated data currently may take a prohibitive amount of time and effort on the part of an individual investigator. And even for these high-impact phenomena, the availability of comprehensive re- cords is still only a recent development.


The reason for the lack of comprehensive records for most of the ES phenomena is mainly due to the perception that they do not pose immediate and/or severe threat to life and property. Thus they are not consistently tracked, monitored, and catalogued. Many phenomena even lack commonly accepted criteria for definitions. Moreover, various Earth Science observations and data have accumulated to a previously unfathomable volume; NASA Earth Observing System Data Information System (EOSDIS) alone archives several petabytes (PB) of satellite remote sensing data and steadily increases. All of these factors contribute to the difficulty of methodi- cally identifying events corresponding to a given phenomenon and significantly impede system- atic investigations. In the following we present a couple motivating scenarios, demonstrating the issues faced by Earth scientists studying ES phenomena.

Heat Wave Heat kills by taxing the human body beyond its abilities. In a normal year, about 175 Americans succumb to the demands of summer heat. Among the large continental family of natural hazards, only the cold of winter—not lightning, hurricanes, tornadoes, floods, or earthquakes—takes a greater toll. — National Weather Service web site1 Heat waves pose a serious public health threat, yet a standard definition of “heat wave” does not exist. Many researchers argue for a changing threshold in heat wave definition (Robinson 2001; Abaurrea et al 2005, 2006). In hot and humid regions, physical, social, and cultural adapta- tions will require higher thresholds to ensure that only those events perceived as stressful are identified. A researcher interested in understanding heat waves will have to first devote a great deal of time in research before associated, concurrent observations (ground-based or remote sensing) can be obtained and analyzed. A literature search for heat wave definitions will yield multiple con- flicting definitions. Some decision on what qualifies as a heat wave would need to be made, fol- lowed by a search for the incidences that satisfy the definition. Each investigator, who is not in collaboration or association with other researchers of like in- terest, must repeat a similar process. Intense and prominent episodes of heat waves are easier to find and not likely to raise questions on definition. Thus, a few intense cases are likely studied repeatedly with great scrutiny while the less intense cases are largely ignored or never identified. An intriguing possibility with undesirable consequences is that emphasis on intense cases (or cases that impact populated regions) has been known to induce biases in past investigations of some phenomena. A systematic treatment of events could thus be an important mechanism for eliminating such bias in certain types of investigation. Additionally, the criteria for heat waves perhaps should be defined depending on the purpose of the investigation. If the purpose is about the physiological effect of excessive heat, aspects such as physical, social, and cultural adaptations should be taken into account. In contrast, as a purely physical phenomenon, latitude-dependent absolute thresholds would be more appropriate. Blizzard According to the National Weather Service glossary2: “A blizzard means that the following conditions are expected to prevail for a period of 3 hours or longer: 1) sustained wind or frequent gusts to 35 miles an hour or greater; and 2) considerable falling and/or blowing snow (i.e., reduc- ing visibility frequently to less than 1⁄4 mile).” Consequently, blizzard is also an ES phenomenon that does not have an unambiguous definition, because both considerable and frequently are vague and not quantified. There exists an opportunity to explore for a more definitive specification. The blizzard entry of Wikipedia3 provides a list of well-known historic blizzards in the United States as well as one instance in Iran in 1972, but is obviously far from complete, espe- cially from a global perspective. Although reanalysis data may not contain a visibility field and usually have a temporal resolution coarser than 3 hours, one might still use the wind and falling snow criteria to find a collection of incidences that contains (i.e., is a superset of) the great ma- jority blizzard events. The spatiotemporal coordinates obtained through this mechanism can in turn aid in the discovery of concurrent, associated observations to determine the qualified cases of blizzards.