Difference between revisions of "Linked Open Research Data for Earth and Space Science Informatics"

From Earth Science Information Partners (ESIP)
Line 72: Line 72:
 
# Find a SWRC class that corresponds to AGU Sections.
 
# Find a SWRC class that corresponds to AGU Sections.
 
# Create a new class for AGU (and potentially other meeting) sections.
 
# Create a new class for AGU (and potentially other meeting) sections.
 +
Tom: this depends on if we use SWRC or switch to TWC ontology as mentioned above. How would AGU Sections fit into the TWC ontology?
  
 
=== Crawler is crawling links which do not lead to abstracts ===
 
=== Crawler is crawling links which do not lead to abstracts ===

Revision as of 11:55, August 12, 2011

Linked Open Research Data for Earth and Space Science Informatics

Tom Narock and Eric Rozell

A 2011 ESIP Funding Friday Project

Abstract: Earth and Space Science Informatics (ESSI) is inherently multi-disciplinary, requiring close collaborations between scientists and information technologists. Identifying potential collaborations can be difficult, especially with the rapidly changing landscape of technologies and informatics projects. The ability to discover the technical competencies of other researchers in the community can help in the discovery of collaborations. In addition to collaboration discovery, social network information can be used to analyze trends in the field, which will help project managers identify irrelevant, well-established, and emerging technologies and specifications. This information will help keep projects focused on the technologies and standards that are actually being used, making them more useful to the ESSI community.

We address this problem with a solution involving two components: a pipeline for generating structured data from AGU-ESSI abstracts and ESIP member information, and an API and Web application for accessing the generated data. We use a Natural Language Processing technique, Named Entity Disambiguation, to extract information about researchers, their affiliations, and technologies they have applied in their research. We encode the extracted data in the Resource Description Framework, using Linked Data vocabularies including the Semantic Web for Research Communities ontology and the Friend-of-a-Friend ontology. Lastly, we expose this data in three ways: through a SPARQL endpoint, through Java and PHP APIs, and through a Web application. Our implementations are open source, and we expect that the pipeline and APIs can evolve with the community.  

Useful Tools

DBPedia Spotlight:

  • text - the text you want annotated
  • confidence - a threshold for terms that are annotated
  • support - the minimum number of inlinks a Wikipedia page must have for annotation
  • Example Service Call

OpenCalais

Workflow Documentation

Step 1: Scrape abstracts from AGU

  • <insert script details here>

Step 2: Convert scraped data to RDF

  • <insert script details here>

Step 3: Annotate converted data with named entities in abstracts

  • Determine appropriate Spotlight settings (e.g., confidence and support)
  • Option 1: Use default annotation service and surface form dictionary
    • For each abstract
      • Feed abstract to Spotlight service
      • Add triples for disambiguated entities
      • Add triples for provenance (e.g., confidence and support)
  • Option 2: Use disambiguation service and use Microsoft Web N-Grams Service for surface forms
    • For each abstract
      • Use Web N-Grams Service and surface forms dictionaries to identify surface forms
      • Update abstracts with surface forms
      • Add triples for provenance (e.g., N-Gram probabilities)
      • Feed abstracts with surface forms to disambiguation service
      • Add triples for disambiguated entities
      • Add triples for provenance (e.g., confidence and support)

Step 4: Load annotated data in a persistent triple store

  • Candidates: Virtuoso, TDB, OWLIM, AllegroGraph
  • <insert triple store details here>

Evaluation

Open Issues

Use of roles in ontology and RDF data

  • Summary: The SWRC ontology treats classes like Employee and Student as subclasses of Person. This is an inaccurate representation of reality, as Employee (or Student) is a role played by a particular Person. See the use case that follows.
  • Use Case: Eric Rozell was an employee of Microsoft Research. In one of his publications, his affiliations included both Microsoft Research and RPI. In a later publication, after leaving Microsoft Research, Eric's affiliation should only be RPI.
  • Problem: If a person is (at some point in time) affiliated with an organization, they will be affiliated in all scenarios. For listing publications with accurate affiliations, we want only the specific affiliations listed for that publication, not all affiliations for the person.
  • Solutions:
  1. Use the Tetherless World Constellation ontology, where affiliations are attached to specific roles played by people.
  2. Create a unique instance for each combination of affiliations needed for a specific person.

Tom: option 2 doesn't seem ideal, and probably not scalable. option is fine with me, but would this ontology break the mobile app, which currently runs on SWRC? Also, is the TWC ontology available online?

Unique identification of organizations

  • Summary: We would like to be able to identify organizations that show up in multiple publications.
  • Use Case: Eric would like to find all abstracts at AGU written by people in affiliation with Woods Hole Oceanographic Institution.
  • Problem: AGU affiliation data is unstructured. We would need to separate the research group, the department, the organization, and the address using some heuristics.

Representation of AGU Sections (e.g., ESSI)

  • Summary: What sort of thing is an AGU Section with respect to the SWRC ontology?
  • Problem: We need to identify whether AGU Sections have a correspondent class in SWRC, or if we should create a new class.
  • Solution:
  1. Find a SWRC class that corresponds to AGU Sections.
  2. Create a new class for AGU (and potentially other meeting) sections.

Tom: this depends on if we use SWRC or switch to TWC ontology as mentioned above. How would AGU Sections fit into the TWC ontology?

Crawler is crawling links which do not lead to abstracts

  • Problem: The crawler currently throws exceptions when it tries to crawl links that do not lead to abstracts.
  • Solutions:
  1. Try to detect bad links before sending them to the abstract parser.
  2. Create a new class of Exception when the abstract parser determines the HTML is not for an abstract.

Keeping unique IDs consistent across iterations of the pipeline

  • Summary: Coining unique IDs for the first time is easy, we need to enable the reuse of those IDs when future data is added.
  • Use Case: I've coined a unique ID for the person, Eric Rozell,
  • Problem: How do we ensure that the IDs we coin now are reused when future data is added via the pipeline?
  1. Do not worry about reusing IDs when future data is run through the pipeline, instead perform post-analytics to determine "same as" relationships. Still, we need to perform collision detection for the URIs that are coined.
  2. Load all the past data before running the pipeline on the new data and perform identification as usual.

Other Ideas

  • ESSI keywords were introduced in Fall 2009. Maybe we should set up a web form to allow authors to go back and annotate older abstracts with ESSI keywords.
  • Chris Lynnes suggested measuring ESIP's impact - need to think of how to do this using our data.
  • ESIP members list is available from Erin/Carol - FOAF?
  • Peter mentioned uncovering hidden/non-explicit network