2022-02-17Wintermeetingrecapandplanning

Usage Based Discovery Cluster February 2022 Meeting

February 17, 2022

Agenda:

Recap winter meeting and takeaways
Plan next for next six months of activity on the Discovery Cluster

People who attended:

Simon Goring
Douglas Rao
Lindsay Barbieri
Chris Lynnes
Tyler Christensen
Vinny Inverso
Jonathan Blythe
Megan Carter

Notes:

Opportunities to Work Together on usage based discovery?

Cross-Agency goal

Objectives:

messaging usage based discovery.
information sharing and learning together.
analytics, content, ML, and UI.

Breakout Session Takeaway; Social asymmetries in usage based discovery:

Group 7, UBD Data Architecture: Discipline boundaries are a source of fragmentation. Investments in a discipline may not extend across the Earth Sciences. UBD Cluster should address fundamental interoperability and governance issues, and build this in anticipation of disparate investments.
Group 1, Messaging UBD: identified an audience for UBD data and tools: early career scientists and interdisciplinary researchers. UBD use case, where people need to get up to speed on dataset use within a discipline on a relatively compressed timeline.
Group 6, Machine Learning: Coolridge Initiative involved just three agencies: USDA/ERD, NSF/Economics Division, and NOAA. NASA would like to get involved. Need to test algorithms for use in Earth Science disciplines - concerns on whether the code can be generalized to other agencies publications and datasets.
Group 5, Topics and findings: Engineering the relevance ranking of search results; careful that UBD doesn't reinforce dataset popularity, because this group had an interest in leveling the playing field among different datasets. Envisioned success criteria where UBD helps users identify the most useful dataset.

Discussion:

Vinny Inverso: NASA Heliophysics Division interested in UBD. How do we connect knowledge graphs.

Megan Carter: ESIP Knowledge Graph Cluster: Wednesday at 4PM.

Jonathan Blythe: Is the Knowledge Graph Cluster addressing discovery?

Chris Lynnes: Traversing unbounded knowledge graphs?

Douglas Rao: Knowledge graphs are often developed for specific use cases

Jonathan Blythe: Based on this discussion, it appears that this Knowledge Graph Cluster has a different focus from our cluster that it shouldn't be a reason for concern, but lets stay informed on their progress.

Simon Goring:

NASA and EPA data management activities are well funded and competent - what about other agencies that have less?

Could scrape all the Datacite DOIs.

Doug Fills - Science on Schema.org is trying to link to datasets across organizations - could bring this into our knowledge graph.

Tyler Christensen:

Irina Gerasimov asked for NASA Earth Obs to get involved in the Coolridge Intiative
Next steps for Coolridge Initiative is to croudsource verifying the output from machine learning algorithms.
Statistical agencies focus.

Chris Lynnes: Reach out to Bob Downs who works at the SEDAC.

Jonathan Blythe: Bob Downs was cited for attending the most ESIP meetings in his award at the last Winter ESIP Meeting. - we can forgive him for not attending the cluster meeting today. But, if he was here, I'm sure he'd have some insight on the boundary between statistical data and earth Earth science data given his role at SEDAC.

Tyler Christensen: Talking to NOAA's chief economist to value earth science data. What is the economic impact of NOAA data?

Jonathan Blythe: US-IOOS (which is part of NOAA) periodically does this kind of economic analysis of observing systems. I'm pretty sure that every major observing system at NOAA and NASA already have an economic evaluation. That is just part of the package for justifying a billion+ dollar investment to congress. What's really needed in terms of economic valuation is the long tail datasets, the one offs and small program funded data collections.

Simon Goring:

he's pretty sure that NSF is funding data management for only 10% of the data produced by NSF researchers.
Neotoma is a good example of the long-tail- this is a data service provided to a community of paleoecology researchers specialized in preserving the Neotoma paleo record.
He could see using Usage Based Discovery to help justify Neotoma data services to the NSF - in terms of measuring benefit of data publication and reuse to the advancement of paleoecology.

Tyler Christensen: Another angle for Usage Based Discovery - the NOAA Chief Librarian is involved in the OSTP Committee on Open Science. NOAA Library has been providing an institutional repository for NOAA publications. However, thus far the participation rate is less than expected. Would it make sense to present data citation to the NOAA Librarian as a possible augmentation of NOAA's Public Access to Research Results requirements, when NOAA researchers are making their publications available via this institutional repository, shouldn't they also be providing DOI links to the datasets that they used in the research?

Jonathan Blythe: Great, let's make this a 6 month deliverable for the UBD Cluster - some sort of presentation to the OSTP Committee. Can we meet with the NOAA librarian in three months or so to start making progress?

Tyler Christensen: Yes, will inquire and get back on next steps.

Jonathan Blythe: What about a 6 month deliverable for the UBD cluster in the area of Machine Learning?

Simon Goring: Named entity recognition within publications (ie. doi scraping from PDF documents), is a Machine Learning application for UBD?

Chris Lynnes: Another angle for Machine Learning is classification of documents for visualization in the UBD tool- We learned it wasn't enough to list all the publications in the UBD tool, because there were too many. We use the ML classifications to help provide this intermediate step between the science topics and the list of research articles to provide this intermediate level of UBD navigation.

Jonathan Blythe: Okay, so we have some ideas for the ML focus of the UBD Cluster, and it's an area of interest to cluster participants, but perhaps we need to discuss this a bit further before we settle on a 6 month deliverable. Lets pick up this topic again next month. See you in March!