Difference between revisions of "Linked Open Research Data for Earth and Space Science Informatics"

From Earth Science Information Partners (ESIP)
Line 1: Line 1:
 
== Linked Open Research Data for Earth and Space Science Informatics ==
 
== Linked Open Research Data for Earth and Space Science Informatics ==
 
Tom Narock and Eric Rozell  
 
Tom Narock and Eric Rozell  
 +
 +
A 2011 ESIP [[FUNding_Friday_Projects|Funding Friday]] Project
 
<br><br>
 
<br><br>
 
'''Abstract:''' Earth and Space Science Informatics (ESSI) is inherently multi-disciplinary, requiring close collaborations between scientists and information technologists.  Identifying potential collaborations can be difficult, especially with the rapidly changing landscape of technologies and informatics projects.  The ability to discover the technical competencies of other researchers in the community can help in the discovery of collaborations. In addition to collaboration discovery, social network information can be used to analyze trends in the field, which will help project managers identify irrelevant, well-established, and emerging technologies and specifications.  This information will help keep projects focused on the technologies and standards that are actually being used, making them more useful to the ESSI community.<br><br>
 
'''Abstract:''' Earth and Space Science Informatics (ESSI) is inherently multi-disciplinary, requiring close collaborations between scientists and information technologists.  Identifying potential collaborations can be difficult, especially with the rapidly changing landscape of technologies and informatics projects.  The ability to discover the technical competencies of other researchers in the community can help in the discovery of collaborations. In addition to collaboration discovery, social network information can be used to analyze trends in the field, which will help project managers identify irrelevant, well-established, and emerging technologies and specifications.  This information will help keep projects focused on the technologies and standards that are actually being used, making them more useful to the ESSI community.<br><br>

Revision as of 13:38, August 5, 2011

Linked Open Research Data for Earth and Space Science Informatics

Tom Narock and Eric Rozell

A 2011 ESIP Funding Friday Project

Abstract: Earth and Space Science Informatics (ESSI) is inherently multi-disciplinary, requiring close collaborations between scientists and information technologists. Identifying potential collaborations can be difficult, especially with the rapidly changing landscape of technologies and informatics projects. The ability to discover the technical competencies of other researchers in the community can help in the discovery of collaborations. In addition to collaboration discovery, social network information can be used to analyze trends in the field, which will help project managers identify irrelevant, well-established, and emerging technologies and specifications. This information will help keep projects focused on the technologies and standards that are actually being used, making them more useful to the ESSI community.

We address this problem with a solution involving two components: a pipeline for generating structured data from AGU-ESSI abstracts and ESIP member information, and an API and Web application for accessing the generated data. We use a Natural Language Processing technique, Named Entity Disambiguation, to extract information about researchers, their affiliations, and technologies they have applied in their research. We encode the extracted data in the Resource Description Framework, using Linked Data vocabularies including the Semantic Web for Research Communities ontology and the Friend-of-a-Friend ontology. Lastly, we expose this data in three ways: through a SPARQL endpoint, through Java and PHP APIs, and through a Web application. Our implementations are open source, and we expect that the pipeline and APIs can evolve with the community.  

Useful Tools

DBPedia Spotlight:

  • text - the text you want annotated
  • confidence - a threshold for terms that are annotated
  • support - the minimum number of inlinks a Wikipedia page must have for annotation
  • Example Service Call

OpenCalais

Workflow Documentation

Step 1: Scrape abstracts from AGU

  • <insert script details here>

Step 2: Convert scraped data to RDF

  • <insert script details here>

Step 3: Annotate converted data with named entities in abstracts

  • Determine appropriate Spotlight settings (e.g., confidence and support)
  • Option 1: Use default annotation service and surface form dictionary
    • For each abstract
      • Feed abstract to Spotlight service
      • Add triples for disambiguated entities
      • Add triples for provenance (e.g., confidence and support)
  • Option 2: Use disambiguation service and use Microsoft Web N-Grams Service for surface forms
    • For each abstract
      • Use Web N-Grams Service and surface forms dictionaries to identify surface forms
      • Update abstracts with surface forms
      • Add triples for provenance (e.g., N-Gram probabilities)
      • Feed abstracts with surface forms to disambiguation service
      • Add triples for disambiguated entities
      • Add triples for provenance (e.g., confidence and support)

Step 4: Load annotated data in a persistent triple store

  • Candidates: Virtuoso, TDB, OWLIM, AllegroGraph
  • <insert triple store details here>

Evaluation

Other Ideas

  • ESSI keywords were introduced in Fall 2009. Maybe we should set up a web form to allow authors to go back and annotate older abstracts with ESSI keywords.
  • Chris Lynnes suggested measuring ESIP's impact - need to think of how to do this using our data.
  • ESIP members list is available from Erin/Carol - FOAF?
  • Peter mentioned uncovering hidden/non-explicit network