Difference between revisions of "ESSI-LOD Python Framework"

From Earth Science Information Partners (ESIP)
(Created page with "== Overview == In a prior implementation of the ESSI-LOD linked data, we used a Java code base to scrape AGU HTML pages for abstracts. Having coordinated with AGU to produce...")
 
Line 38: Line 38:
  
 
=== MeetingStream ===
 
=== MeetingStream ===
 +
 +
* Parses AGU .txt files for meeting abstracts
 +
* Iterates over chunks of the document delimited by HTML comment lines (i.e., <!-- ... -->)
 +
* First attempts to process chunk as Abstract, then Session if Abstract extractor (HtmlAbstract) throws an exception
 +
* Uses HtmlAbstract and HtmlSession subclasses to extract data
  
 
=== SessionInfoStream ===
 
=== SessionInfoStream ===
 +
 +
* Parses AGU .dat files for meeting sessions
 +
* Iterates over each line, extracting a Session instance for each
 +
* Uses SessionSummary subclass to extract data
  
 
== Lookup Classes ==
 
== Lookup Classes ==

Revision as of 20:15, June 6, 2013

Overview

In a prior implementation of the ESSI-LOD linked data, we used a Java code base to scrape AGU HTML pages for abstracts. Having coordinated with AGU to produce convert directly from bulk text files for meeting abstracts, we rewrote the implementation in Python. This wiki page has documentation on how to use (and extend) this Python framework.

Setting up the Framework

Dependencies

  • Python (version?)
  • Virtuoso (version?)

Sources

  • (Google Code?)
  • AGU Meeting Files (.txt)

Usage

Entity Classes

The Entity classes are for objects converted directly into RDF. They are typically instantiated either by a Stream class or by another Entity class.

Meeting

Session

Abstract

Section

Author

Convener

Keyword

Stream Classes

The Stream classes are utility classes for consuming bulk files containing entities. They are typically instantiated directly from a main program.

MeetingStream

  • Parses AGU .txt files for meeting abstracts
  • Iterates over chunks of the document delimited by HTML comment lines (i.e., )
  • First attempts to process chunk as Abstract, then Session if Abstract extractor (HtmlAbstract) throws an exception
  • Uses HtmlAbstract and HtmlSession subclasses to extract data

SessionInfoStream

  • Parses AGU .dat files for meeting sessions
  • Iterates over each line, extracting a Session instance for each
  • Uses SessionSummary subclass to extract data

Lookup Classes

The lookup classes are used for entity conflation purposes. I.e., to consolidate URIs when the evidence provides for it.

OrganizationLookup