IExperiment - Preservation Track

In many areas of science, the partial or complete automation of data processing and analysis has become a key enabler in increasing the efficiency of scientific investigation, ultimately paving the way for large scale e-science. Increasingly, however, the e-science community is coming to realize that producing new information as a result of complex processing is not enough: a wealth of additional metadata, available as part of scientific data processing and generically referred to as "provenance", can be later used to help scientists make sense of, understand, explain, share and reproduce their results. In a broad sense, with the term provenance we refer to any metadata that describes the origins of the data. In more detail, such metadata covers a surprisingly wide gamut of types, from the low-level technical details of service invocation and data transfer between services, to context information (the when, who, what, where, how), to user-generated annotations that provide a semantically rich complement to scientific datasets. From the scientists' perspective, keeping a trace of past activities, and of their results, is a well-established lab practice that is typically manifested in the form of logbooks. As much of the lab activity is increasingly being automated, it is only natural that the same would happen to the management of provenance, a modern version of lab logbooks. With automation come new opportunities as well as technical challenges; indeed, the study of the many forms, uses, and management issues associated to data and process provenance has been gaining momentum over the past few years, as a new discipline at the interface of science and computing alongside the study of the main data and process management requirements and infrastructure for e-science.

The purpose of this talk is to review some of these issues and opportunities, viewed from the perspective of the myGrid research group in information management (http://www.mygrid.org.uk), with a strong tradition in developing innovative technology infrastructure for e-science. Since 2001, the myGrid project has been working on delivering an open source software suite that includes the Taverna workflow management system (http://www.taverna.irg.uk), as well as myExperiment (http://www.myexperiment.org) social networking site for sharing workflows and the BioCatalogue (http://www.biocatalogue.org) for cataloguing the world\'92s Life Science web services. The myGrid effort has been largely practical: Taverna has been widely adopted by the academic research community - more than 350 organizations around the world including 30 US universities and 17 commercial organisations. In 2008 there were over 4000 active users. More recently, the myExperiment Web-based Virtual Research Environment (http://www.myexperiment.org) has been developed to enable scientists to share, reuse and socially-curate workflows in a trusted way, with full attribution and credit. Using Web 2.0 and social computing techniques, the public myExperiment.org web site has already gathered >1400 users world wide, sharing >600 workflows from >10 different workflow management systems. myExperiment has over 600 scientific workflows donated by their authors. The talk will provide an overview of these efforts, analyse how research and practice in provenance management builds on these initiatives, provide an insight into some of the technical issues, and outline our vision for provenance-aware e-science applications.