Interoperability and Technology/Tech Dive Webinar Series
__NOTUC__
Past Tech Dive Webinars (2015-2022)
9 March 2023: "Meeting Data Where it Lives: the power of virtual access patterns"
Mike Johnson (Lynker, NOAA-affiliate) will rant and rave about the VRT and VSI (curl and S3) virtual data access patterns and how he's used them to work with LCMAP and 3DEP data in integrated climate and data analysis workflows.
Recording:
Minutes:
- VRT stands for "ViRTual"
- VSI stands for "Virtual System Interface"
- Framed by FAIR
LCMAP – requires fairly complex URLs to access specific data elements.
3DEP - need to understand tiling scheme to access data across domains.
Note some large packages (zip files) where only one small file is actually desired.
NWM datasets in NetCDF files that change name (with time step) daily as they are archived.
Implications for Findability, Availability, and Reuse – note that interoperability is actually pretty good once you have the data.
VRT: – an XML "metadata" wrapper around one or more tif files.
Use case 1: download all of 3DEP tiles and wrap in a VRT xml file.
- VRT has an overall aggregated grid "shape"
- Includes references to all the individual files.
- Can access the dataset through the vrt wrapper to work across all the times.
- Creates a seamless collection of subdatasets
- Major improvement to accessibility.
If you have to download the data is that "reuse" of the data??
VSI: – allows virtualization of data from remote resources available as a few protocols (S3/http/compressed)
Wide variety of GDAL utilities to access VSI files – zip, tar, 7zip
Use case 2: Access a tif file remotely without downloading all the data in the file.
- Uses vsi to access a single tif file
Use case 3: Use vsi within a vrt to remotely access contents of remote tif files.
- Note that the vrt file doesn't actually have to be local itself.
- If the tiles that the vrt points to update, the vrt will update by default.
- Can easily access and reuse data without actually copying it around.
Use case 4: OGR using vsi to access a shapefile in a tar.gz file remotely.
- Can create a nested url pattern to access contents of the tar.gz remotely.
Use case 5: NWM shortrange forecast of streamflow in a netcdf file.
- Appending "HDF5:" to the front of a vsicurl url allows access to a netcdf file directly.
- The access url pattern is SUPER tricky to get right.
Use case 5: "flat catalogs"
- Stores a flat (denormalized) table of data variables with the information required to construct URLs.
- Can search based on rudimentary metadata within the catalog.
- Can access and reuse data from any host in the same workflow.
Use case 6: access NWM current and archived data from a variety of cloud data stores.
- Leveraging the flat catalog content to fix up urls and data access nuances.
Flat catalog improves findability down at the level of individual data variables.
Take Aways / discussion:
Question about the flat catalog:
"Minimal set of shortcuts" to get at this fast access mechanism.
Is the flat catalog manually curated?
More or less – all are automated but some custom logic is required to add additional content.
Would be great to systematize creation of this flat catalog more broadly.
Question: Could some “examples” be posted either in this doc or elsewhere (or links to examples), for a beginner to copy/paste some code and see for themselves, begin to think about how we’d use this? Something super basic please.
GDAL documentation is good but doesn't have many examples.
climateR has a workflow that shows how the catalog was built.
What about authentication issues?
- S3 is handled at a session level.
- Earthengine can be handled similarly.
How much word of mouth or human-to-human interaction is required for the catalog.
- If there is a stable entrypoint (S3 bucket for example) some automation is possible.
- If entrypoints change, configuration needs to be changed based on human intervention.
9 Feb 2023: "February 2023 - Rants & Raves"
The conversation built on the "rants and raves" session from the 2023 January ESIP Meeting, starting with very short presentations and an in-depth discussion on interoperability and the Committee's next steps.
Recording:
Minutes:
- Mike Mahoney: Make Reproducibility Easy
- Dave Blodgett: FAIR data and Science Data Gateways
- Doug Fils: Web architecture and Semantic Web
- Megan Carter: Opening Doors for Collaboration
- Yuhan (Douglas) Rao: Where are we for AI-ready data?
I had a couple major take aways from the Winter Meeting:
- We have come a long way in IT interoperability but most of our tools are based on tried and true fundamentals. We should all know more about those fundamentals.
- There are a TON of unique entry points to things that, at the end of the day, do more or less the same thing. These are opportunities to work together and share tools.
- The “shiny object” is a great way to build enthusiasm and trigger ideas and we need to better capture that enthusiasm and grow some shared knowledge base.
So with that, I want to suggest three core activities:
- We seek out presentations that explore foundational aspects of interoperability. I want to help build an awareness of the basics that we all kind of know but either take for granted, haven’t learned yet, or straight up forgot.
- We ask for speakers to explore how a given solution fits into multiple domain’s information systems and to discuss the tension between the diversity of use cases that are accommodated by an IT solution targeted at interoperability. We are especially interested to learn about the expense / risk of adopting dependencies vs the efficiency that can be gained from adopting pre-built dependencies.
- We look for opportunities to take small but meaningful steps to record the core aspects of these sessions in the form of web resources like the ESIP wiki or even Wikipedia. On this front, we will aim to construct a summary wiki page from each meeting assembled from a working notes document and the presenting authors contribution.