Difference between revisions of "ESSI-LOD Python Framework"
Line 44: | Line 44: | ||
* Delete the current graph for that file: | * Delete the current graph for that file: | ||
** /projects/abstracts/virtuoso-scripts/vdelete http://abstracts.agu.org/graphs/{version}/{year}/{meeting type ID} | ** /projects/abstracts/virtuoso-scripts/vdelete http://abstracts.agu.org/graphs/{version}/{year}/{meeting type ID} | ||
− | + | ** --OR-- /projects/abstracts/virtuoso-scripts/vdelete http://abstracts.agu.org/graphs/{version}/{people|organizations|sections|keywords|static} | |
− | |||
* Add the edited .ttl file: | * Add the edited .ttl file: | ||
** /projects/abstracts/virtuoso-scripts/vload ttl {/path/to/file.ttl} http://abstracts.agu.org/graphs/{version}/{year}/{meeting type ID} | ** /projects/abstracts/virtuoso-scripts/vload ttl {/path/to/file.ttl} http://abstracts.agu.org/graphs/{version}/{year}/{meeting type ID} | ||
− | + | ** --OR-- /projects/abstracts/virtuoso-scripts/vload ttl {/path/to/file.ttl} http://abstracts.agu.org/graphs/{version}/{people|organizations|sections|keywords|static} | |
− | |||
==== Other utilities ==== | ==== Other utilities ==== |
Revision as of 14:36, June 16, 2013
Overview
In a prior implementation of the ESSI-LOD linked data, we used a Java code base to scrape AGU HTML pages for abstracts. Having coordinated with AGU to produce convert directly from bulk text files for meeting abstracts, we rewrote the implementation in Python. This wiki page has documentation on how to use (and extend) this Python framework.
Setting up the Framework
Dependencies
- Python 2.x
- Virtuoso 6.1.x
Sources
- Google Code SVN Repository - http://linked-open-data-essi.googlecode.com/svn/
- AGU Meeting Files (.txt) - /projects/abstracts/python
Usage
Convert AGU meetings into RDF
- Move .txt file(s) that must be converted into staging directory (e.g., /projects/abstracts/python/data/agu/staging)
- Change working directory to Python ESSI-LOD project root (i.e., /projects/abstracts/python/essilod)
- Run: ./meeting_parser.py --sparqlMatching /path/to/staging/directory [/path/to/output/directory/optional]
- If you need to regenerate RDF for keywords or sections, use the --keywords and --sections flags, resp.
- Note, this process will take a little while to run as it loads all the organization identifiers into main memory (for conflation purposes)
- After the process has finished running, check the output .rdf files for syntax using the rapper utility: rapper -c /path/to/data.rdf
- For clarity, here is a list of commands to run when FM13 data is released:
cp ~/fm13.txt /projects/abstracts/python/data/agu/staging cd /projects/abstracts/python/essilod ./meeting_parser.py --sparqlMatching ../data/agu/staging cd ../data/agu/staging rapper -c -i turtle fm13.rdf rapper -c -i turtle people.rdf rapper -c -i turtle organizations.rdf
Load RDF into triple store (i.e., index)
Convert AGU Session info into RDF
Versioning for AGU
Manual fixes for meeting data
- Sometimes the easiest way to make a fix is to edit the Turtle RDF (.ttl) files directly...
- Make the necessary edits (the .ttl files are usually in /projects/abstracts/python/data/rdf/)
- Delete the current graph for that file:
- /projects/abstracts/virtuoso-scripts/vdelete http://abstracts.agu.org/graphs/{version}/{year}/{meeting type ID}
- --OR-- /projects/abstracts/virtuoso-scripts/vdelete http://abstracts.agu.org/graphs/{version}/{people%7Corganizations%7Csections%7Ckeywords%7Cstatic}
- Add the edited .ttl file:
- /projects/abstracts/virtuoso-scripts/vload ttl {/path/to/file.ttl} http://abstracts.agu.org/graphs/{version}/{year}/{meeting type ID}
- --OR-- /projects/abstracts/virtuoso-scripts/vload ttl {/path/to/file.ttl} http://abstracts.agu.org/graphs/{version}/{people%7Corganizations%7Csections%7Ckeywords%7Cstatic}
Other utilities
Entity Classes
The Entity classes are for objects converted directly into RDF. They are typically instantiated either by a Stream class or by another Entity class.
Meeting
Stores metadata about AGU meetings, such as year, type identifier (e.g., FM), and the set of associated AGU meeting sessions.
Session
Stores metadata about AGU meeting sessions, such as identifier, name, conveners, sponsoring session, and set of associated AGU meeting abstracts.
Abstract
Stores information about AGU meeting abstracts, such as identifier, title, authors, and keywords, as well as the abstract itself.
Author
Stores information about AGU meeting abstract authors, such as name, email and affiliation.
Convener
Stores information about AGU session conveners, such as name and affiliation.
Keyword
Stores information about AGU keywords, such as description and identifier.
Stream Classes
The Stream classes are utility classes for consuming bulk files containing entities. They are typically instantiated directly from a main program.
MeetingStream
- Parses AGU .txt files for meeting abstracts
- Iterates over chunks of the document delimited by HTML comment lines (i.e., )
- First attempts to process chunk as Abstract, then Session if Abstract extractor (HtmlAbstract) throws an exception
- Uses HtmlAbstract and HtmlSession subclasses to extract data
SessionInfoStream
- Parses AGU .dat files for meeting sessions
- Iterates over each line, extracting a Session instance for each
- Uses SessionSummary subclass to extract data
Lookup Classes
The lookup classes are used for entity conflation purposes. I.e., to consolidate URIs when the evidence provides for it.