Difference between revisions of "ESSI-LOD Python Framework"

Latest revision as of 14:59, June 16, 2013

Overview

In a prior implementation of the ESSI-LOD linked data, we used a Java code base to scrape AGU HTML pages for abstracts. Having coordinated with AGU to produce convert directly from bulk text files for meeting abstracts, we rewrote the implementation in Python. This wiki page has documentation on how to use (and extend) this Python framework.

Setting up the Framework

Dependencies

Python 2.x
Virtuoso 6.1.x

Sources

Google Code SVN Repository - http://linked-open-data-essi.googlecode.com/svn/
AGU Meeting Files (.txt) - /projects/abstracts/python

Usage

Convert AGU meetings into RDF

Move .txt file(s) that must be converted into staging directory (e.g., /projects/abstracts/python/data/agu/staging)
Change working directory to Python ESSI-LOD project root (i.e., /projects/abstracts/python/essilod)
Run: ./meeting_parser.py --sparqlMatching /path/to/staging/directory [/path/to/output/directory/optional]
- If you need to regenerate RDF for keywords or sections, use the --keywords and --sections flags, resp.
- Note, this process will take a little while to run as it loads all the organization identifiers into main memory (for conflation purposes)
After the process has finished running, check the output .rdf files for syntax using the rapper utility: rapper -c /path/to/data.rdf

For clarity, here is a list of commands to run when FM13 data is released:

 cp ~/fm13.txt /projects/abstracts/python/data/agu/staging
 cd /projects/abstracts/python/essilod
 ./meeting_parser.py --sparqlMatching ../data/agu/staging
 cd ../data/agu/staging
 rapper -c -i turtle fm13.rdf
 rapper -c -i turtle people.rdf
 rapper -c -i turtle organizations.rdf

Load RDF into triple store (i.e., index)

Convert AGU Session info into RDF

Versioning for AGU

In order to reduce down time of the AGU abstract browser, we have introduced graph versioning so that new data can be loaded while old data is still operational.

The new convention for URIs is: http://abstracts.agu.org/graphs/{version}/{year}/{meeting_type_ID} (Note, the previous convention for URIs was: http://abstracts.agu.org/graphs/{year}/{meeting_type_ID})

We have introduced a new optional parameter in the load.py utility (see above section on loading the data) for version info. The usage is:

 python load.py /path/to/data/dir [version_ID] [base_URI] [/path/to/vload] [file_extension]

In a typical versioning scenario:

A user may have regenerated metadata from the meeting .txt files (including meetings, keywords, sections, people, and organizations)
They may want to reuse the static.ttl file being used by a previous version.
- Be sure to find/replace the instances of the old versioned URI and replace it with the new version, e.g.:

 http://abstracts.agu.org/graphs/1.0 => http://abstracts.agu.org/graphs/1.1

Load the .ttl files using the load.py script (setting the correct version ID)
Update the necessary files in LODSPeaKr to use the new version graphs:
- Change the static GRAPH reference and the graph metadata reference in /projects/abstracts/lodspeakr/components/includes/graphs.inc

Manual fixes for meeting data

Sometimes the easiest way to make a fix is to edit the Turtle RDF (.ttl) files directly...

Make the necessary edits (the .ttl files are usually in /projects/abstracts/python/data/rdf/)
Delete the current graph for that file:
- /projects/abstracts/virtuoso-scripts/vdelete http://abstracts.agu.org/graphs/{version}/{year}/{meeting_type_ID}
- --OR-- /projects/abstracts/virtuoso-scripts/vdelete http://abstracts.agu.org/graphs/{version}/{filename} for non-meeting files (e.g., keywords.ttl)
Add the edited .ttl file:
- /projects/abstracts/virtuoso-scripts/vload ttl {/path/to/file.ttl} http://abstracts.agu.org/graphs/{version}/{year}/{meeting_type_ID}
- --OR-- /projects/abstracts/virtuoso-scripts/vload ttl {/path/to/file.ttl} http://abstracts.agu.org/graphs/{version}/{filename} for non-meeting files (e.g., keywords.ttl)

Here are a few examples to clarify. Say we needed to fix a typo in the title of an abstract for FM12:

 Edit /projects/abstracts/python/data/rdf/fm12.ttl, find and replace the incorrect title
 /projects/abstracts/virtuoso-scripts/vdelete http://abstracts.agu.org/graphs/1.0/2012/FM
 /projects/abstracts/virtuoso-scripts/vload ttl fm12.ttl http://abstracts.agu.org/graphs/1.0/2012/FM

Or, if we needed to fix a type in the name of an AGU index term (i.e., keyword):

 Edit /projects/abstracts/python/data/rdf/keywords.ttl, find and replace the incorrect title
 /projects/abstracts/virtuoso-scripts/vdelete http://abstracts.agu.org/graphs/1.0/keywords
 /projects/abstracts/virtuoso-scripts/vload ttl fm12.ttl http://abstracts.agu.org/graphs/1.0/keywords

Other utilities

Entity Classes

The Entity classes are for objects converted directly into RDF. They are typically instantiated either by a Stream class or by another Entity class.

Meeting

Stores metadata about AGU meetings, such as year, type identifier (e.g., FM), and the set of associated AGU meeting sessions.

Session

Stores metadata about AGU meeting sessions, such as identifier, name, conveners, sponsoring session, and set of associated AGU meeting abstracts.

Abstract

Stores information about AGU meeting abstracts, such as identifier, title, authors, and keywords, as well as the abstract itself.

Author

Stores information about AGU meeting abstract authors, such as name, email and affiliation.

Convener

Stores information about AGU session conveners, such as name and affiliation.

Keyword

Stores information about AGU keywords, such as description and identifier.

Stream Classes

The Stream classes are utility classes for consuming bulk files containing entities. They are typically instantiated directly from a main program.

MeetingStream

Parses AGU .txt files for meeting abstracts
Iterates over chunks of the document delimited by HTML comment lines (i.e., )
First attempts to process chunk as Abstract, then Session if Abstract extractor (HtmlAbstract) throws an exception
Uses HtmlAbstract and HtmlSession subclasses to extract data

SessionInfoStream

Parses AGU .dat files for meeting sessions
Iterates over each line, extracting a Session instance for each
Uses SessionSummary subclass to extract data

Lookup Classes

The lookup classes are used for entity conflation purposes. I.e., to consolidate URIs when the evidence provides for it.

Revision as of 14:59, June 16, 2013 (view source) Erozell89 (talk \| contribs) (→‎Versioning for AGU) ← Older edit		Latest revision as of 14:59, June 16, 2013 (view source) Erozell89 (talk \| contribs) (→‎Lookup Classes)
Line 130:		Line 130:

	=== OrganizationLookup ===		=== OrganizationLookup ===
		+
		+	=== PersonLookup ===