RDF, Inference Engines, and the Web

Since starting in the College of Liberal Arts in November, 2007, as the new lead developer for digital humanities, I’ve been putting together some design ideas and initial code towards a Digital Resources Workbench.

Most digital humanities projects that I’ve seen are cataloging a lot of information and providing a search interface for exploring that information. It could be editions of a particular book, images on a topic, or perhaps more granular information about concepts, places, people, events, etc. A mix of artifact and annotation.

This seems to break down into a fairly simple set of responsibilities: entry, storage, retrieval, interpretation, and presentation. If I can define simple, open, standard APIs to interface between each stage, then each stage can be fairly independent and reusable by itself. Artifact and annotation have slightly different needs, so we need to break those apart into artifact collection and knowledge base. There might some special cases, but I’m trying to push as much as possible any “special handling” into the back end.

Artifacts have a central document (image, text, video, etc.) that can be considered a canonical source. Some meta-data will be associated with it, such as the size of the document, the type, the collection it belongs to, perhaps some ACLs and a URL for public access. This meta-data is the minimal set needed to manage the artifact itself. Any information such as who created it, where it was published, etc., would be in an associated knowledge base.

Knowledge bases are collections of facts. These can be written as RDF triples (predicate, subject, object) with the subject being the resource described, the predicate indicating an aspect of the subject, and the object being the value associated with that aspect. If we want to add a little power to the knowledge base, we can add an inference engine, such as Prolog, and a set of rules.

An inference engine lets us ask questions such as “who are the descendants of A?” or “which papers lead to P through a citation chain?” or even “Are A and B related?” without having to write a lot of programming code.

The resources in the knowledge base will be exposed through a REST interface that will allow editing (with the proper permissions) as well as reading. This allows the greatest flexibility in using applications from web browsers to other non-end-user systems, such as other research projects.

So that knowledge bases and inference engines are available, but aren’t open to abuse (inferencing can require a lot of computation), I’m planning on offering the results of running a particular query as an RSS or Atom feed. This is similar to the interface for listing all of the resources of a particular type in a model, but would be read-only. The actual query would be defined through some administrative interface that would associate a public URL with the query — similar to how Yahoo! Pipes exposes its pipes. This has the added advantage of disassociating the resource identifier from the resource definition for the query, making client applications easier to maintain.

If licensing works out (which we expect it to at the moment), I should have a code release in a month. It won’t have ACLs, inferencing, or other “fancy” features, but it should be a proof-of-concept for the basic ideas.

Published by

James

James is a software developer and self-published author. He received his B.S. in Math and Physics and his M.A. in English from Texas A&M University. After spending almost two decades in academia, he now works in the Washington, DC, start up world.