Notes from Collaborative Strategies for Sustained Environmental Data Management workshop (Tempe, AZ Nov 2015)

From Federation of Earth Science Information Partners

Notes from Collaborative Strategies for Sustained Environmental Data Management Workshop (Tempe, AZ Nov 2015)

The following notes where taken by scribe Margaret O'Brien of the discussions that happened at the Tempe Nov 2015 2.5 day workshop.
Return to: Collaborative Strategies for Sustained Environmental Data Management: Agenda & Notes (Tempe, AZ Nov 2015)

Notes from 2015-17-11 discussions

Margaret Hedstrom - also has a project with SEAD, icpsr

Steve Daley-Larson studies how orgs come together to collab?

Bob Downs, Columbia, CIESIN

Peter McCartney: why is less funding is a high bar, and ___ a low bar.

Cyndy: interesting exercise, how would you do this if funding was not an issue.

highest cost: people area: transferring data from researcher to repository


Rick Hooper: what is metric for success? is it reuse? rules that say you can’t submit annual report till you submit data is a OMB directive. but does it help the science?

taxpayer POV how to extract the most value from data -- max value initially, plus reuse.

Margaret Hedstrom: incentives: expectation that DMPs are sufficient motivation to get researchers to archive data DMPs are the hammer, but we have to data on whether this is effective

How does a library service (eg, inst. repository) help researchers advance their work an issue: sharing does not necessarily help research.

primary motivation: publishing a paper. so writing metadata is a dead weight. cost associated.

one possible incentive: documented reuse

we shift the cost of work (metadata creation) back and forth (between repo and researcher). but until we recognize that data documentation is a workflow.

ethics: if your research is not reproducible, are you actually doing science? no.

conclusion: if data were not documented to the same level as a paper, is it published? no. defining what the product is.

definitions for process of “posting” vs “sharing” - eg, posting is easy, sharing requires more work. (“sharing” is the wrong word, alone it has no concept of documentation.)

quick survey question from Phil Tarrant: publications part of the every tenure conversation. are data? (no)

note: the likelihood of data citations affecting tenure decision is debatable. People who produce data products are not all tenure-types.

Sherry Lake’s highest priorities: create ways to capture metadata get researchers early in cycle to think about archiving and sharing train the trainer, (librarians) -how to manage data, where to deposit particpate in SHARE ? (is this an org, or project?)

These are mostly human-related.

scientists-student relationship is SOP. eg, field notes, lab notebook. welre not all clueless, but tech is changing. we have to figure out how to integrate in a univ env with the prof-student relationship. everything from the software (profs have no concept of software mgt) formalize training for students. how to entrain profs and retrain. we are starting somewhere (200yr tradition) - maybe inadequate in a digital era, we can do more things. potentail to do more with it.

note: this discussion is about culture, not precisely where we want this to go.

Carolina Murcia: procedures manuals in a company help continuity. need those for data collection, methodology as well. change how

Cyndy Parr: ethics training: put data mgt into the univs ethics training. responsible conduct for reaserch training. (sherry has done the DM part at UVa already, live)

Paul McCartney: tension between univ and domain repos. univs are finally starting to invest in these. given that there is a cost: the more that is bone at the inst (OH rates, etc), the better. repos in domains are competing with research dollars, so harder for nsf to justify. [OH still coming out of research dollars, but is proportional.]

Matt Jones - how much and what kind of metadata is enough? it will vary. part of the tension. part of answer is “what do you want to do with it”

define usage and each one’s to-do: discover: link federations: d1, SHARE, geoss interpret: machine processing: (at NCEAS, a postdoc 1-2 years/20 datasets)

Jones highest priorities: some metadata better than one, complete, machine readability versioning and ids (different mental models of data objects, and how they are treated/updated) prov and semantics added to metadata (text metadata on methods not enough) shared PIs

might see some consolidation, but will continue to be diverse - have to accommodate legacy practices. groups move at different speeds.


Ruth Duerr: data formats Unstructured vs structured unstr: allows flexibility, including anything inhibits aggregation ()

defined: limits capture of new info (that doesn’t fit the model) facilitates reuse (by a domain, but often because model builders introduce domain-specs)

when is predefined format worth it: when community recognized that they get more from it that it costs.

priorities for discovery: define audience (who is the data for), what is their world-view invest in stds that work for that community facilitate processing by that community ensure people with broad understanding (of stds), facilitate communities to prevent them reinventing the wheel


Wade Sheldon: the combination of requirement to share + rigid formats is dangerous. forcing info into the wrong fields, etc.


Cyndy Chandler: curation that can be kept close to point of acquisition. semantic markup connect instead of copy, except for archive copy (canonical) prov tracking (this could be a way to connect ‘copies’ - this copy is derived from that) priorities - what should be archived (need archivists)

other bits to keep in mind: identifiers (dois) don’t mean data is usable. -- (something else) quality and accuracy will emerge as barriers as automated usage increases. trust breaks down fast

copies: libraries have this attitude that copies are good - ensure preservation. but we (and researchers) like the notion of a reference version (authoritative source)

Margaret Hedstrom: level of granularity: where do you assign an ID? what is the level to attach metadata? individual data value? file? collection? same with prov (differs for community)

2. You can’t have everything. we;ve never saved everything. we might have to change the default condition from we want everything. we want the exceptional


Bob Downs: broad vs selective repositories researchers don’t know the difference.

can’t avoid the fact that we have to set priorities. eg, we don’t know what’s valuable yet.


Tables

red: we dont have good info about connection between repositories, data and users. need sustained funding. repos could offer services for pay (at deposit), menus of choices. comments:

Margaret Hedstrom. costs often come from research dollars. the cost needs to be known. Peter McCartney: eg, repo might cost 5million, must break up this cost into infrastrucutre, specialized interfaces? MH no, a proportion of research funds now goes to data (mgt, including contribs to repos). want to determine that. Greg Janee: could common infrastructure reduce costs? how? what is scale? feds use contractors. their profit motive adds complexity. (part of noaa’s mission is to support industry. its in commerce)


Blue:

see their notes in this dir: 20151117_phoenix_blue_table repos effect a culture change, and are regarded similarly to publishers. promote data, add value and services (viz, annotation,review, metrics, prov)

comments: what’s missing in systems today? (refer to notes) did you talk about other communities besides scientists? eg, users? Glyn Thomas: are you not (you as data managers) part of science? Cynthia Parr: eval’d differently than researchers. Cyndy: we mostly are not “practicing scientists”, but are imbedded in


Yellow:

value added by the repo. repos are not a sink, they help the sci. process.

comments: Taxing Wei: what is timeline? seems longer than 5-8 years. C.Parr: But some of this already possible - LTER does the subscribe-rerun Phil Tarrant: if we got our act together, we could accomplish quite a ot. _____: what can repo provide to the contributor? (metrics of future use. easy getting things in)

what are limitations, how to get there from today: culture shift, trust (repos are trusted)


Green:

designate a community, global scale. expand that community. limitations: silo’d repos (agency, domain, publishers). current org’s workflows that create a certain type of infrastructure that don’t match scientists activities. lack of incentives for scientists, lack of understanding of user communities.

redefine as interop services, rather than whole stacks. run independently. scale, evolve. (different groups do storage, QA, viz, etc) repos should become selective on what they (curate). need clear descriptions of how groups decide training: recognize differences from publishing (where does the metaphor fail?) 3-yr grants absolutely inadequate. funding needs to be based on stability, sustainability, not novelty and experimentation.

flip side of interop services is their dependencies. HOw to mitigate this? aside from stable funding. move among multiple service providers. eg, not being implementation-driven (eg, all based on x, so


General comments:

Bob Downs: each table presented a different facet. form whole, one could combine them.

Sherry Lake: environmental data is from more than one science. we have not touched on inclusion of other types - eg social science.

Carolina Murcia - [did not hear ].

Ken Casey: his org , NOAA, already in a reorg. some similar discussions in that group.

Phil Tarrant: common problems,

Wilson: after this workshop, then what? will there be a report? what is follow on? report, synthesis of survey data, you might need to create an entity, to get further… part of tomorrow’s agenda is to talk about where this group goes next.

Margaret Hedstrom: an entity? we already have many entities. is there an existing one that can be the platform?

Peter McCartney: Do all the existing entities cover everything we need? and all we have to do is inform the community?

Matt Jones: having an entity doesn’t solve the problem. Philip Tarrant: everyone has the same problem. great output from this group would be to identify the ‘bites’ - what are the small bits to eat first.

Matt Jones: report? there have been lots of reports. 4-6 that articulate these issues. been articulated many times.

Philip Tarrant: real output is that we have a way forward

Matt Jones: still dependent on funding. structural barriers to collaboration. funding at level of individual institutes, grants are short-term. programs at NSF don’t even last. we don’t want to repeat this again in 3-5 years.

Peter McCartney: those were tech-driven projects. not community driven. funded this activity to find where the gaps are - what do we need to fill the data management need for a significant resaerch community. this sort of dialog needs to be more than just a workshop.

_____ some groups codify [what repo will do] into policies. can we talk about that? Glyn Thomas: not on the agenda now. _____: 20m project, data moving into a new repo (theirs is ending), but researchers want their old policy supported. cannot do it.

Glyn Thomas: responding to Peter: IT is just hardware, getting people to change to use it very different task.

Margaret Hedstrom: to Peter: GEO says it won’t find data curation, NSf wont fund long term archiving. Peter McCartney: one major project with data curation: 40 years, by NSF for 20. justified by unambiguous msg that this is a critical resource. track called sustaining track: provide means for operations (not building stuff). terms: only can ask for ongoing costs, as determined by value perceived by NSF.

The Future of Environmental Repositories

Questions from organizers: If data repositories were managed in the most effective and efficient manner, what results might be enabled within the sectors in which this data is used?

Given the presentations and conversations thus far, what are the three greatest limiting factors to pursuing the results you described in the previous question?

What would need to change in the way in which environmental data repositories are managed in order to overcome these limiting factors?

In order to engender the level of collaboration necessary to be successful, what changes would need to be made in the way in which repositories conduct their work?


The Future of Environmental Data Repositories

SLIDE SUMMARIES

Table Group Color: Yellow

Our Vision:

In the future, data repositories operate in a world where a researcher, informed by outreach or helpdesk support, know where to deposit their data, have tools and assistance to richly describe and format their data and workflows and code easily as they submit to a trusted, certified repository. Depositing new versions or derivatives is straightforward. They do so because they know that their contribution will be properly acknowledged and provides benefits to them. The data submission policy is aligned with community needs. Another researcher is able to efficiently find large amounts of relevant cross-disciplinary data of sufficient quality for their innovative inquiry. They may even be guided to new insights and ideas just from the data. They can choose their own platform for the search, and it won’t matter what words they use in the query because repositories have collaborated on semantics and standards. Google even has a “Data” button to the left of “Images.” In fact, their analysis is alive -- the system helps focus attention on new data that would be important for testing the hypothesis. Data consumers and reviewers can track provenance and painlessly provide feedback, so potential bias or malicious intent or poor quality are minimized. Documented retention schedules ensure high quality data are easy to find. Data-intensive research is accessible to those who have limited resources – there is a level playing field. The repository has sufficient resources to do its job well because its funders have access to metrics that demonstrate the impact of their investment in the research.


Table Group Color: Blue

Our Vision is that a consolidated set of active, authoritative repositories would provide economies of scale and add value to the science community by providing seamless acquisition, access and use of environmental data. This shift in repository management would effect a culture change by: Promoting data to a first class citizen; Providing mechanisms to give credit (formal citations; encourage citation and give altmetrics - downloads) and report Adding value to the data (analysis environments and features like annotations, reviews, metrics, provenance, integrity) Educating science community, Engaging early career researchers Notes: bit.ly/20151117_phoenix

Table Group Color: Red

Our Vision:

We envision repositories expanding scientific inquiry in time and space, enabling policy making and creating opportunities for comparative and multidisciplinary research by designing systems that are based on users’ observations and feedback, understanding the value of repositories and being open to change in standards, services, and tools. The data are inherently heterogeneous, the research goals are organically changing, and the mapping between data and user needs, and repositories capacity to address those needs is lacking. The overall system will be a distributed network of financially viable specialized repositories that share their activities, services, and capacities and have established menus of services and cost so that the users know for what to pay.


Table Group Color: Green

Our Vision:

Vision: We envision an interoperable, collaborative network of sustainable data repositories that supports open science for a designated earth and environmental science community over the next 5 - 10 years at global scales. This network will allow researchers to discover, access, and re-use data from other researchers, and will grow to support other designated communities, including decision-makers, policy makers, teachers, and students. Researchers will be able spend much more time doing science, and a lot less ‘wrangling’ data in pursuit of synthetic and collaborative studies.


Limitations: 1) The current governance and funding structures for repositories (government agency-based, domain-based, institution, and publisher) inhibit sustainable cooperation and interoperability. Different types of repositories compete for limited funds, build and customize their own software, and create redundant services. Government funding for research is short term and does not support development of sustainable infrastructure or long-term preservation. Infrastructure requirements are defined by available funding and goals of benefactors rather than scientific users. 2) The current organization of repositories, workflow, and the “data publication” model creates and/or reproduces disciplinary silos, institutional silos, and other divisions that make discovery and reuse overly dependent on repositories rather than scientific need. The data publication metaphor/model may be inadequate. Although published snapshots are useful for fixed amounts of authoritative data associated with a published paper, they don’t capture the full granularity of the data, and do not match the volume, heterogeneity, and recombination possibilities of data. 3) The relationships between data producers, consumers, and repositories are compromised by the lack of participation from data producers, and a mismatch of incentives between producers/consumers. Repositories don’t have a sufficient understanding of the re-use community.

Approaches: To overcome these limiting factors, repositories should be redefined as a set of interoperable services, not end-to-end stacks (e.g., data storage, metadata authoring and management, replication services), and in so doing, provide a set of services that can be swapped out, interchanged, scaled to new levels, and based on an evolving suite of standards. Repositories should become selective of the data to be curated for preservation. We must increase the number of data managers and curators in repositories by two orders of magnitude and ensure that every scientist who produces data is proficient at managing data in her own discipline. Repositories need to recognize the granularity, identification, and provenance issues inherent in data that differ from the traditional publication model.

Enabling Collaboration: In order to engender successful collaboration among repositories, funders must develop new models that support long-term funding of repositories and repository services, and that evaluates infrastructure based on stability and sustainability rather than novelty and experimentation. Repositories need consistency in data review criteria to justify initial deposit and sustained (re-)use of research data throughout the research cycle, with far greater efficiency than today.

RED

<https://docs.google.com/document/d/1_gdhw3T_z08pqxxYvsgujIttWK80NUbqCufDT4JrKnA/edit?ts=564ba3d3#>

  • Anne Wilson (coordinator)
  • Richard Hooper
  • Ruth Duerr
  • Wade Sheldon
  • Nancy Hoebelheinrich
  • Inna Kouper
  • Peter McCartney
  • Corinna Gries


We envision repositories expanding scientific inquiry in time and space, enabling policy making and creating opportunities for comparative and multidisciplinary research by designing systems that are based on users’ observations and feedback, understanding the value of repositories and being open to change in standards, services, and tools.

The data are inherently heterogeneous, the research goals are organically changing, and the mapping between data and user needs, and repositories capacity to address those needs is lacking. The overall system will be a distributed network of financially viable specialized repositories that share their activities, services, and capacities and have established menus of services and cost so that the users know for what to pay.

Notes:

Where are we now?

  • Many different repositories, but lack of integration
  • Repositories are of different quality, but do we have enough repositories, too many, not enough of certain type?
  • Long-tail users - where would they put their data?
  • NSF funding requests need to connect the need to preserve data with the value of that data

Assumptions:

  • different repos with varying reputations, e.g. some people want to put their data in certain repositories, because data are better curated, better quality, etc.
  • data documentation, citation and attribution is the norm and is part of professional recognition
  • publication requires supporting data
  • researchers are motivated and incentivized to deposit their data
  • change is happening, although at different rates

Why is it important to have well-documented data? Is well-documented data the end result? We need to be explicit about the reasons for doing data management.

  • Reasons for doing this - scientific reproducibility, decision-making based on real data (enabling science and policies).
  • Data needs to be preserved so that observations are recorded going back in time, at multiple locations, etc.
  • The essence of environmental science - multidisciplinary, needs access to data from various disciplines.
  • Is data that is in repositories useful for re-use? E.g., right granularity, right scale, etc. We need analyses of use, metadata sufficiency, etc.


Questions from organizers:
If data repositories were managed in the most effective and efficient manner, what results might be enabled within the sectors in which this data is used? expanding temporal and spatial scope of scientific inquiry and enabling comparative and multidisciplinary inquiries enabling policy- and decision-making that uses evidence from scientific observations and modeling tools for evaluation of data suitability for re-use

2. Given the presentations and conversations thus far, what are the three greatest limiting factors to pursuing the results you described in the previous question? systems are designed based on assumptions that are not based on observations of actual users (researchers) research goals are organic and it’s hard to rely on them in repository design and management curation is expensive and it is not clear which uses justify what costs value of repositories and services change lack of educational programs for curators and data managers (e.g., in library and information science) users, data managers, repositories - no good connection between the three

3. What would need to change in the way in which environmental data repositories are managed in order to overcome these limiting factors? better ways to align user needs, repository metrics, and data managers capacities

4. In order to engender the level of collaboration necessary to be successful, what changes would need to be made in the way in which repositories conduct their work? incorporate feedback from providers, mediators, users, etc. be open to change in standards, interfaces, tools, etc. become attentive to audiences of data (designated communities) that change too develop and accept shared best practices in data documentation and deposit (similar to the journal model that has author guidelines, reviewers, licensing agreements) establish better education and outreach support each other through sharing of experiences, activities, capacities, and services for the purposes of creating a coherent network of specialized repositories and distributing the overall burden of curating and sharing data conduct a study of repository value and return on investment break down levels of services and cost for them (e.g., cost of keeping the system, curation, tools development)

Vision:

  • Communities and their practices are monitored over time with assessment tools and ways to address their needs are changed based on that feedback
  • Repository identifies when new products are needed based on researcher needs
  • Repository is driven by research needs, questions, use cases
  • The repository is financially viable and stable; the expectations of data longevity are matched to the funding model
  • Functionality and value changes over time, which affects types of services and forms of assessments
  • Different repositories specialize in different subjects and optimize their expertise, there is an interface to the scientists (SEAD tool - Matchmaker)
  • Repositories have menus of services and cost - users pay or do it themselves

GREEN

  • Matt Jones
  • Erin Clary
  • Steven Daley-Laursen
  • Bob Downs
  • Paul Gessler
  • Margaret Hedstrom
  • Dave Rugg

<https://docs.google.com/document/d/1rAScFKHcSuCTfWdfIl9dyibhpLlA1NMhG9adnMbDtes/edit>

http://bit.ly/1MA4ZNJ

Vision: We envision an interoperable, collaborative network of sustainable data repositories that supports open science for a designated earth and environmental science community over the next 5 - 10 years at global scales. This network will allow researchers to discover, access, and re-use data from other researchers, and will grow to support other designated communities, including decision-makers, policy makers, teachers, and students. Researchers will be able spend much more time doing science, and a lot less ‘wrangling’ data in pursuit of synthetic and collaborative studies.

Limitations: 1) The current governance and funding structures for repositories (government agency-based, domain-based, institution, and publisher) inhibit sustainable cooperation and interoperability. Different types of repositories compete for limited funds, build and customize their own software, and create redundant services. Government funding for research is short term and does not support development of sustainable infrastructure or long-term preservation. Infrastructure requirements are defined by available funding and goals of benefactors rather than scientific users. 2) The current organization of repositories, workflow, and the “data publication” model creates and/or reproduces disciplinary silos, institutional silos, and other divisions that make discovery and reuse overly dependent on repositories rather than scientific need. The data publication metaphor/model may be inadequate. Although published snapshots are useful for fixed amounts of authoritative data associated with a published paper, they don’t capture the full granularity of the data, and do not match the volume, heterogeneity, and recombination possibilities of data. 3) The relationships between data producers, consumers, and repositories are compromised by the lack of participation from data producers, and a mismatch of incentives between producers/consumers. Repositories don’t have a sufficient understanding of the re-use community.

Approaches: To overcome these limiting factors, repositories should be redefined as a set of interoperable services, not end-to-end stacks (e.g., data storage, metadata authoring and management, replication services), and in so doing, provide a set of services that can be swapped out, interchanged, scaled to new levels, and based on an evolving suite of standards. Repositories should become selective of the data to be curated for preservation. We must increase the number of data managers and curators in repositories by two orders of magnitude and ensure that every scientist who produces data is proficient at managing data in her own discipline. Repositories need to recognize the granularity, identification, and provenance issues inherent in data that differ from the traditional publication model.

Enabling Collaboration: In order to engender successful collaboration among repositories, funders must develop new models that support long-term funding of repositories and repository services, and that evaluates infrastructure based on stability and sustainability rather than novelty and experimentation. Repositories need consistency in data review criteria to justify initial deposit and sustained (re-)use of research data throughout the research cycle, with far greater efficiency than today.

Notes: Integration and collaboration of data repositories over a temporal scope - 5 - 10 years, spatial scope of global? What results enabled?

  • People are able to (re-)use research data for any stage of the research cycle - not just discovery, but exploration and integration, etc.
  • Re-use of data outside of research cycle (e.g., decision-makers, policy makers, teachers in schools)
  • Serving a designated community and perhaps beyond the designated community
  • serving beyond the primary designated community may be further out - first focus should be on a designated community (community of individuals w/ shared/common knowledge).
  • Primary designated community: Earth and environmental science researchers
  • Enable open science
  • Reproducibility, open methods, interoperability
  • Facilitate data-intensive science
  • Repository/institution disappear into the background
  • Sustainable infrastructure
  • Scientists spend most time doing science, and a lot less ‘wrangling’ data
  • Three greatest limiting factors

Current Governance and Funding Structures

  • Funding links to institutions inhibits sustainable cooperation/interoperability
  • Programs @NSF and elsewhere are ephemeral, not support for long-term infrastructure
  • Infrastructure dependent on benefactor rather than users

The current governance and funding structures for repositories (government agency-based, domain-based, institution, and publisher) inhibits sustainable cooperation/interoperability. Different types of repositories compete for limited funds, build and customize their own software, and create redundant services. NSF, NIH and and other government funding for research is short term and does not support development of sustainable infrastructure or long-term preservation. Infrastructure requirements are defined by available funding and goals of benefactors rather than scientific users.

Organization of repositories, workflow, and “data publication” model

  • Repositories mirror disciplinary silos, institutional silos, etc.
  • Data publication metaphor/model may be inadequate
  • Published snapshots are useful for fixed amounts of authoritative data associated with a published paper, but don’t capture the full granularity of the data, and doesn’t match the volume, heterogeneity, and recombination possibilities of data
  • Relationships Between data producers, consumers, and repositories
  • Lack of participation from data producers / incentives mismatch between producers/consumers
  • Repositories don’t understand the re-use community
  • Unclear how people locate data
  • Unclear how data are re-used, why and how
  • Data heterogeneity necessitates varied management approaches
  • Redundant infrastructure development
  • Legacy investment in built infrastructure inhibits interoperability, slows change
  • Agency view that they are legally mandated to be the authoritative source for federally generated data
  • Changes needed in way env repos are managed to overcome limitations
  • Repositories should be redefined as a set of interoperable services (not end-to-end stacks); (e.g., data storage, metadata authoring and management, replication services); repositories provide a set of services that can be swapped out, interchanged, scaled to new levels, based on an evolving suite of standards
  • Repositories need to become selective of the data to be curated for preservation
  • Need to increase the number of data managers and curators in repositories by two orders of magnitude; and we need every scientist who produces data to be proficient at managing data in her own discipline (scientific metadata standards, analytical tools, credibility and integrity, etc.) from advisors/mentors through to graduate and undergraduate students
  • Need incentives for producers to participate/contribute; specific value
  • Repositories need to recognize the granularity, identification, and provenance issues inherent in data that differ from the traditional publication model
  • To engender collaboration, what changes are needed in how repositories conduct their work?
  • Long-term funding of repositories and repository services;
  • Funders need to evaluate infrastructure based on stability and sustainability, rather than novelty and experimentation (distinguish between infrastructure and research on infrastructure)
  • Initiate consistency in review criteria for data to justify initial and continuing curation investments (taxonomy of review criteria)

BLUE

  • Paul Hanson
  • Greg Janee
  • Sherry Lake
  • Erin Robinson
  • Dave Vieglais
  • Yaxing Wei

<https://docs.google.com/document/d/1ilDu29fOiw-c1ObtjmHXKJL8i6MmCYPfDCJz2ArnjMU/edit >

Our Vision is that a consolidated set of active, authoritative repositories would provide economies of scale and add value to the science community by providing seamless acquisition, access and use of environmental data. This shift in repository management would effect a culture change by: Promoting data to a first class citizen; Providing mechanisms to give credit (formal citations; encourage citation and give altmetrics - downloads) and report Adding value to the data (analysis environments and features like annotations, reviews, metrics, provenance, integrity) Educating science community, Engaging early career researchers More notes : bit.ly/20151117_phoenix


Notes:

transparency, and lower barriers to participation through

If data repositories were managed in the most effective and efficient manner, what results might be enabled within the sectors in which this data is used?

Provide economies of scale Value of data (culture change) at level of paper - 1st class citizen; Citations (which would be increased in our effective and efficient - reliable repository) to datasets used for promotion and tenure Data able to be reused - benefit to creator Repeatable science Quality check/improvement of the data Marketplace for data Users will not only be able to find data products meeting their needs, but also are recommended with the “best” data products for them. Enable users to provide feedback on data (annotating the utility of data for particular purposes), thumbs up/down for data products, these feedbacks will be properly managed by data repositories and shared back to the user community Data repository provides value-added layer context (provide analysis environments) - integrity in tracking; what you get out is what went in. Provenance tracking Transparent, seamless - people want to provide and retrieve data repositories Active Repositories as analysis environments with integrated tools, services - more value added for the contributor / user Make repositories more accessible for early researchers. Lower the barrier for early career scientists to clearly see the available data resources in any specific science domain (in the past and present) and this will help them to build ideas to move science forward more easily Federated repositories that people can add to one repository and find across or find through repository of choice Seamless use of the repository by users (contributors and consumers)


Given the presentations and conversations thus far, what are the three greatest limiting factors to pursuing the results you described in the previous question?

Research incentives not clear expectations Repositories are passive Technical - No federation of repositories Culture - professional can add only so many dimensions to their work; not willing to change the way they do their research; researchers generate data for themselves


Identifying value added benefits to users will provide the greatest benefits to users Making contributors more active in the archive / repository process (increasing awareness of the need to store the data in a way that facilitates re-use) Integration of repositories into a common framework is a difficult process since there are technical and operational aspects that need to be aligned Culture problem - trust of someone using, credit, designing research that leverage other data, skills barrier (professional can add only so many dimensions to their work); build into workflow

What would need to change in the way in which environmental data repositories are managed in order to overcome these limiting factors? Active repository with value add; broaden the target that the data can be used for. Consolidate repositories

In order to engender the level of collaboration necessary to be successful, what changes would need to be made in the way in which repositories conduct their work? Educate the science community Engage researchers to identify value (new papers, better collaborations, exposure, shareable among themselves) Engage early career scientist (new ways of business) Give credit; formal citations; encourage citation and give altmetrics (downloads) and report

YELLOW

  • Cyndy Parr
  • Kevin Browne
  • Cyndy Chandler
  • Carolina Murcia
  • Ken Casey
  • Mark Servilla

Yellow Vision Slide

Our Vision:

In the future, data repositories operate in a world where a researcher, informed by outreach or helpdesk support, know where to deposit their data, have tools and assistance to richly describe and format their data and workflows and code easily as they submit to a trusted, certified repository. Depositing new versions or derivatives is straightforward. They do so because they know that their contribution will be properly acknowledged and provides benefits to them. The data submission policy is aligned with community needs. Another researcher is able to efficiently find large amounts of relevant cross-disciplinary data of sufficient quality for their innovative inquiry. They may even be guided to new insights and ideas just from the data. They can choose their own platform for the search, and it won’t matter what words they use in the query because repositories have collaborated on semantics and standards. Google even has a “Data” button to the left of “Images.” In fact, their analysis is alive -- the system helps focus attention on new data that would be important for testing the hypothesis. Data consumers and reviewers can track provenance and painlessly provide feedback, so potential bias or malicious intent or poor quality are minimized. Documented retention schedules ensure high quality data are easy to find. Data-intensive research is accessible to those who have limited resources – there is a level playing field. The repository has sufficient resources to do its job well because its funders have access to metrics that demonstrate the impact of their investment in the research.

Notes

1. results we want

data are preserved researchers can bring their ideas to the platform and test it with existing or even newly arrived data --can both discover available data and use the data no onematter where it lives -- more efficient than now, e.g. more interoperability --greater reproducibility of scientific results (for more papers there are data available to reproduce). A corrolary is that new data can more easily refine or refute old ideas --more long term larger scale research is possible -- workflows (e.g. scripts) are also made available with the data (e.g. LTER ecosystem indices already do this) -- enables larger, more transparent collaborations that help reduce individual bias in data collection or selection -- repository recommends new data sources that might be suitable "You may also be interested" -- level-playing field so individuals can also do large scale research regardless of their computing resource and SHARE their large-scale data products -- have access to HPC -- users can know the quality of data and feedback from other users --the repository could work with the analysis capability --the layer of discovery/visualization may or may not live in the repository, but most likely computation can be moved to the data --ultimately, infrastructure exists to help researchers discover "ideas" from the data -- Google has a "data" button

2. Barriers: -- Lack of metadata -- Quality is hard to determine, especially for different uses -- Can't expect future versions of data to have semantic continuity -- Based on cultural norms, many scientists don't understand that their data have more value when put into a larger context, and are not willing to meet a high bar themselves -- Tracing provenance of derived data is dificult especially if it has to be manually derived -- Difficult to detect duplication -- Too much data, may or may not be relevant what do you do?


3. - Submission requirements are aligned with community needs - Incentives: incuding that the repository ensures that derivatives deliver impact points to original data - There is facilitation of data deposits, both human and machine - Monitoring compliance, both to repository standards and external mandates - Deduplication (Intelligent compression) Intelligent filtering

4. - Finding reviewers of metadata and/or policies will require trust and transparency - More repositories become certified as trusted repositories - Repositories collaborate to ensure that it is less important exactly which venue you use to deposit or find your data -- Agree on standards - have a governance structure to support and the work - jointly manage ask-a-data-manager virtual helpdesk - help produce general outreach material & curricula and make sure it gets used by researchers as early as possible in the process - change the social paradigm that having usable data in the system is cool and worth the effort -- prestige thing -- tools for stakeholders/funders to see the metrics on the data in the repository to empower them that their money is being well used. Accountability dashboard.

————————

Notes from 2015-18-11 discussion

scribe: Margaret O’Brien

ESIP - Erin Robinson
loose> formal: cluster, working group, committees (clusters most similar to IMC working groups)

ESIP backbone (based on: http://ssir.org/articles/entry/collective_impact

tool match: http://wiki.esipfed.org/index.php/ToolMatch esip governance examples seem to be best practice guidelines?


Goverance/IEDA - Kerstin Lehnert
IEDA = merged group of 12 data systems at Lamont under NSF CI. (5 years on?) -IEDA builts joint services, some level of interop, to enhance their domain repos. groups in original 12 are aligned into 2 broad groups: sample-based or sensor-based.

see COPDESS (requirements of publishers, for data to go with papers)

builds infrastructure for interop stacks. repos are the “alliance partners”. they become part of the coop agreement. so it is now contractural.

now adding partners with earth cube funding. = “alliance test bed” ATP, help new groups to plug in. brought in a social science group to help with organizational stuff (facilitate, write charter, roles and responsibilities of indiv partners) new orgs challenges: social and org engineering diversity of data needs diversity of systems business models

Glyn calls these use cases of of the early part of self-organization, on the way to contractual agreements - maturation.

are these “the same interests” evolving in slightly different ways? or are they really different? what are the common? (missions overlap, but not completely). our overlaps are partial, so even the ways we overlap are on a continuum.

Greg: tech/infrastructure. is it a common repo?or a federator [more like a federator] what does ieda offer? [they move RDBs to IEDA servers], professional hosting, backup. so they have templates for discipline-specific metadata requirements, they have ported data models, etc. but still hard to tell what IEDA actually does.

apodopsis data portal(?) nsf funded, came from a community workshop, built on existing infrastructure, etc., some descoping. (this still sounds like a domain repo, though). ? what is PDB? infrastructure wants to be risk-averse. risk is not in tech, it’s in coordination, organization.

what NSF wants is “infrastructure that makes data available at no more than incremental cost” nsf won't say what the infrastructure should be, but we can’t make data available if an infrastructure doesn't exist. maybe a blueprint that the community endorses? but ‘if you build it, we will use it’ not enough.

other agencies represented here: cross agency stuff is even harder. at the earth/env science level. we should be working on the common problems, our goal is not to influence NSF. but funding boundaries become very firm walls.

Anne Wilson: ESIP data study: get NRC to develop a unifying vision that could work across agencies.

NODC is 50 years old, and has NSF-funded ocean data. was then even operating under the assumption that repos could be collaborative.

AFTERNOON

Intro to interest group breakouts: where do we start?

————


Landscape Analysis and Gaps for Environmental Data Repositories

https://docs.google.com/document/d/1rYs3nOku5oCZCYcemNn-1dwJ4pahP8fCpHFD0YVFlHg/edit# Link: http://bit.ly/env-repo-landscape

  • Paul McCartney
  • Cyndy Chandler
  • Sherry Lake
  • Yaxing Wei
  • Inna Kouper
  • Erin Clary
  • Kerstin Lehnert
  • Matt Jones
  • Corinna Gries

Slides:
We define this arena as environmental data repositories in the US that hold natural and social science research data. Analysis must be done of repositories that are missing, strengths of existing repositories, as well as connections between those repositories and approaches to cross-repository shared infrastructure/ services/ tools. It is critical to help repositories improve their services, and to help users to decide what repositories match their needs based on their requirements.

Top 3-4 outcomes and timelines:
1.a survey of capabilities of existing systems (start with re3data, fix/correct the data, add missing systems) 2.analysis of the re3data catalog for overlaps and coverage gaps 3.recommendation document for how to improve cross-repository collaboration and shared infrastructure

  1. 3 will be done by convening a ‘coalition of willing’, repositories who want to invest in interoperability. Recommendations should be tiered on what can be done 1) “for free”, 2) with minimal funding, 3) with extensive funding.

capabilities/resources: leverage ESIP or RDA via its interest / working groups or data share fellowship (early career researchers doing the work?); dedicated and supported people are key

Notes:

How do you define this arena?
What is the scope of ‘environmental’ (designated community)? = Earth & Environmental researchers; social science data should be part of it What categories of infrastructure should be included in the analysis? Repositories per se, classified by domain of data covered (see re3data, COPDESS, etc.) Features/strengths of repositories Trusted/certified? Disciplinary versus institutional or programmatic

International vs national Cross-agency? open contribution versus closed community Archival mission? Licensing policies DOI/identifier support FundRef assignment Connections/collaboration between repositories Approaches to cross-repository shared infrastructure/services/tools e.g., shared user and authentication systems, shared APIs, shared identifier assignment services Software tool support/matchmaking between people/tools/repositories Three thoughts: how well are existing repositories supporting their stakeholders? how efficient that data repo are operating internally? how are data repositories interacting w/ each other and with repositories in other disciplines ? Task is to analyze the re3data lists to find the gaps in coverage for certain key features How is the research community not served well by existing infrastructure/repos? especially across current providers need to go back to researchers to determine if they are served What existing survey data exists (NASA Why is this a critical arena for action? Different stakeholders have varying needs/perspectives, so need this landscape analysis to evaluate those different needs Help make decisions - help repositories make decisions and connect with each other to serve their users better Help users to decide what repositories are good options for their needs What might the top 3-4 outcomes / results be within this area? And, in what time frame might these be realistically achieved? Survey of capabilities of existing systems (start with re3data, fix/correct the data, add other needed systems) Analysis of overlaps and gaps Recommendation document for how to improve cross-repository collaboration and shared infrastructure What are the high level actions that would need to be taken to pursue these outcomes/results? Develop the survey using a small, targeted number of items / services Analyze the data Develop the recommendations by convening a coalition of willing/able repositories who want to invest in interoperability. Recommendations should be tiered on what can be done 1) “for free”, 2) with minimal funding, 3) with extensive funding. What capabilities/recourses (those in existence and yet to be developed) would be needed to engage in these actions? Could be conducted within the context of ESIP or RDA via its interest / working groups or Data Share fellowship (early career researchers doing the work?) Will need dedicated and supported people (see above)

Return on Investment of Data Repositories for Society

https://docs.google.com/document/d/10KKUjKnO3LbzCBr3Qu5_JhVBVaNiBn67nmqfRASDVeg/edit

URL: http://bit.ly/1HZa3ci for group

  • Anne Wilson
  • Ruth Duerr
  • Peter McCartney
  • Steve Daley-Laursen
  • Nancy Hoebelheinrich
  • Carolina Murcia
  • Erin Robinson
  • Dave Rugg

Slides:

Potential Arena for Action

ROI to society? Or improving that? Just the economics of data centers characterize the ROI environment discovering what is good to measure in order to qualify and quantify the value it’s about developing the method for determining ROI

To whom: people who fund curation researchers education decisions makers

—- Why critical?

no funding if value not seen or services not used, for sustainability reasons funding is from many places:  funding orgs, projects, many different decisions made all along the line, including voters not all motivated by same criteria to get additional users because currently not used enough so managers know how to manage for value an opportunity is lost if data are lost, lost opportunity Loss of initial investment, capital

—— Top 3 - 4 outcomes?

Develop a common framework for establishing value of data repository infrastructure, services, operations (includes qualitative and quantitative assessment) Make measurements Reports on individual repos ROIs Translate findings into messages meaningful to particular stakeholder communities Characterize user base with metrics (e.g. number of papers originating from the data, download/usage statistics) including penetration into societal impacts, e.g., data results used in government reports NASA likes data stories, e.g. ‘this image was on the news!’ able to compare apples to apples ——

What high level actions are needed to achieve outcomes?

Identify what the types of information are that would be useful and feasible to collect on an ongoing basis and determine how to measure them, based on value expressed by stakeholders, e.g., testimonials of use (the data story) and # of users Ways to harmonize assessment results among data repositories Determine quantitative and qualitative methods (see JISC report) Identify high level, societal impacts of measuring for purpose of understanding and improving dc functionality —— What capabilities, resources are needed?

Data repo manager buy in so that they are willing to provide human resources from the repos Codify (document) framework Develop tools, standards to support Educating the repos to actually do it Identify ways for community to express support for data repositories to stakeholders including funders, overall managers, e.g., university provosts, etc. What info needs to be known by what stakeholders in order for the information to be understood and applied? —-

Notes:

then do measurements

client base: society, not just sci community to support decision making data users, providers

educational impacts, an impact on society value of data for tracking long term trends

includes analysis and vis tools ‘repositories’ include the services they provide

need to be thoughtful about political considerations: science? Or data, specifically? For now, we focus on data

Why critical???? no funding if value not seen or services not used for sustainability reasons funding is from many places: funding orgs, projects, many different decisions made all along the line, including voters not all motivated by same criteria to get additional users because currently not used enough so managers know how to manage for value an opportunity is lost if data are lost, lost opportunity Ruth: in general there is a U curve use of data at NSIDC: usage drops then rises loss of initial investment, capital

Common Technical Vision

https://docs.google.com/document/d/1CoAT_ZIZSzNPb0cZHC3bfkXbc8xO4ZLrShy_8Ow9CD0/edit#heading=h.k09oe6hl9osw


URL: http://bit.ly/1HH5Yz7

  • Cyndy Parr
  • Dave Vieglas
  • Ken Casey
  • Margaret Hedstrom
  • Greg Jarvee
  • Mark Sevilla
  • Kevin Browne

Slides

How do you define this arena? Arena: Common Technical Vision

Goal: sustainability of data, not repositories, more robust links in the face of dependencies (e.g. controlled vocabularies), better adherence to standards Unified framework - clear architecture and understanding of projects and their responsibilities and where they fit into the framework Goals, standards, APIs and service endpoints, may or may not involve common file storage or structures, provenance, version control

—- What might the top 3-4 top outcomes / results be within this arena? In what time frame might these be realistically achieved?

Many of these exist but need a bit more refinement to be useful, so could be done within a couple of years:

Next 3-5 years: Best practices for researchers on how to package their data for broad use without losing all the relationships in the package Guidance for repository developers and data producers on how to consistently apply standards like PROV in this domain (and encourage development of validation tools to help with that) Common set of use cases (and anti-use cases) with examples in our domain of how to apply abstract standards) Identification of appropriate standards based on similar recommendation efforts (US GEO, etc)

These will take a few more years: Longer term: An implementation plan (based on above recommendations) sitting above the new OAIS ref model, bringing to it current technologies, emphasizing interoperability in our domain (searching across many packages, computing across a repository) Establishing a technical review board to advise on plans for repositories

High level actions that would need to be taken to pursue these outcomes / results:

Convene the appropriate working group such as an ESIP cluster then working group Write an NSF planning grant to bring more stakeholders to provide input to all of these things Participate in DataONE outreach efforts

Capabilities / resources (those in existence and those yet to be developed) would be needed to engage in these actions: Travel funds Commitments from organizations for employee time Community sharing infrastructure (e.g. Google Docs or a wiki, teleconferencing)

Notes:

Scientists and science funding agencies pay for building as little of the infrastructure as possible Some level of shared infrastructure Interoperability of repositories Sustainability - do we need more in common under the hood compared to just commonality at the level of interfaces distributed infrastructure Infrastructure needs to support heterogeneous needs and heterogeneous data

Why is this a critical arena for action?

Avoids inefficiencies of duplicated effort -- build on economies of scale There's a lot of heterogeneity in solution implementations and some of that is for good reason, but there should be a framework that allows interoperability and collective goals to be met even if not all participants handle all parts of the life cycle Make it easier to migrate to future platforms or across platforms

Entrain Stakeholders

  • Robert Downs
  • Rick Hooper
  • Paul Hanson
  • Wade Sheldon
  • Margaret O’Brien (scribe)

(interest group folded on 2015-19-11 due to lack of participant interest. Group members joined other interest groups)

URL: http://bit.ly/1OyEGvT

Slides

Arena: Entrain Stakeholders There are a variety of stakeholders, their input is critical in development and operation of a data center contributors distributors (staff of the repo) funders orgs that employ the distributors end users (target audience) superset of contributors. it’s a very domain specific repo, it may be a closed loop. (contributors create data for each other) —-

Why is this a Critical Arena Critical part of the investment in a data center (ROI) Stakeholders are the ones who should identify the direction for development, operation Engender stakeholder commitment to the repo for sustainability Outcomes Process for end user input that is fast, frequent and iterative Recognition that stakeholder groups change over time and engagement has to be ongoing Stakeholder evaluation process that “has teeth” and can drive the direction the repository takes (e.g., evaluators write a letter to the funder, NASA) ———

Actions Modification to RFPs to require a robust user engagement plan, that delineate stakeholder groups, identifies senior personnel involved Development process that is linked to the feedback of the stakeholders Capabilities/Resources that are Needed Flexible mechanisms to compensate stakeholders to support engagement Facilitation or creation of partnerships to educate stakeholder groups about dev process ——

Notes:

Our groups POV’s LTER/CUAHSI: mandate, plus an understanding that data need to be put out there. NASA: direction from a user community. ie, look at actual needs for users, then find data to meet them. these are “focus areas” (their internal shorthand “Missions”.). Downs: his center about human interactions in the environment.

Went to the users (eg, hydrologists), and asked them what they needed. They said “better access”. well, we’ve done that, and now we need to refine that. We have advisory committees, ec. but there is never adequate attention to outreach, so never enough attention give to how we get feedback from scientists. how to bring them along and even alter the way they do science (eg. virtual lab notebooks, etc)

define the arena: there are a variety of stakeholder, there input is critical to all stakeholders in dev and operation of a repo’s dev

who are the stakeholders: contributors distributors (staff of the repo) funders orgs that employ the distributors end users (target audience) superset of contributors. it’s a very domain specific repo, it may be a closed loop. (contributors create data for each other)

contributors need: repo that understand their data contributors better. their workflow, why they should do this. are repos meeting their needs

why is this critical: parametrizes the ROI: this is part of the investment. stakeholders are the ones who should identify the direction for dev, operation, engender their commitment to the repo. comment: high-end stakeholders want really high functionality. we won’t get buyin if we only shoot for the lowest common denominator. how do we balance? there are some who have found the sweet-spot: exemplars. (http://www.iris.edu/hq/ for seismic instruments, a vary focused type of data, that has high integration. theirs was a very specific use case.)

If repos are to add value, and there’s a misalignment between stakeholders/dev goals, repo will underperform.

top 3-4 outcomes process for end user input that is fast, frequent and iterative incremental improvements to the repo, for illustration, and get feedback, agile development includes more end-product mockups -- not just approval at the too-nerdy level. recognition that stakeholder groups change over time. stakeholder evaluation process that “has teeth” and can drive the direction the repository takes (e.g., evaluators write a letter to the funder, NASA) comment: repos will have strategic planning


high level actions: modification to RFPs to require a robust user engagement plan, that delineate stakeholder groups, identify senior personnel who will enact this engagement. dev process will be linked to the feedback of the stakeholders. demonstrate effectiveness of your user engagement as for other aspects of you work.


what capabilities/resources are needed flexible/creative mechanisms to compensate stakeholders comment: creative means ways that are valuable to them -- usually it is not paying them directly. might be some other currency -eg contribute to postdoc fund, etc facilitation or a partnership to educate stakeholder groups about “dev process”