Algorithmic Provenance of Data

It doesn't seem like it's been over four years since I joined MITH and started working with Project Bamboo. Just because I've moved on to a startup and the project's been mothballed doesn't mean we can't mine what was done.

The problems with Project Bamboo are numerious and documented in several places. One of the fundamental mistakes made early on was the waterfall approach in designing and developing an enterprise style workspace that would encompass all humanities research activities rather than produce an agile environment that leveraged existing standards and tools. Top down rather than bottom up.

However, the idea that digital humanities projects share some common issues and could take advantage of shared solutions is important. This is part of the reporting aspect of research: when we learn something new, we not only report the new knowledge, but how we got there to help someone else do similar work with different data. If we discover a way to distinguish between two authors in a text, we not only publish what we think each author wrote, but the method by which we made that determination. Someone else can apply that same method to a different text.

Tracking Process

One of the ideas bounced around in Project Bamboo was recording what a tool did to produce a data set. This recording is standard practice in the sciences. When conducting an experiment in a lab, you always record every step you do in excruciating detail. Move 1 ml of a solution from a beaker to a test tube? Write it down. Shake the test tube? Write it down. You never know when there will be a correlation between something you did and the results you get.

This recording is doubly useful when publishing a dataset that has been transformed. For example, if someone were to take the Folger Shakespear TEI files and strip out certain elements to make them easier to process in some situations, then the derived TEI documents should be able to record not only the original documents from which they were derived, but the transformation as well.

This transformation recording is the digital equivalent of an editorial statement:

To produce the text in this edition, we transformed the long s into the terminal s, transformed ligatures into their component letters, and normalized the quotation marks.

Algorithmic Provenance

Rather than placing a block of text in the TEI header, we can add a snippet of RDF/XML that describes the algorithms applied to the original dataset using linked data principles.

This recording of the transformations becomes the "algorithmic provenance" of the dataset. Thus my original title for this post.

What could this look like?

Let's assume we are publishing a stripped down version of the TEI edition of Titus Andronicus that the Folger Library published.

TEI allows us to provide an editorial statement in the document header (the <editorialDecl/> element). The Folger Library uses this section to describe how quotation marks and hyphentation are treated, how the work is segmented, and particular interpreation decisions, among other editorial information that helps the scholar understand why something in this edition of the play might differ from another edition of the play.

This is a natural place for us to describe the algorithms we used to remove unwanted tags from the Folger edition.

Since a TEI document is an XML document, we can apply XSLT, but not every dataset is going to be XML, so we can't assume that all transformations can be done via XSLT. However, we are fortunate here because we can fall back on XSLT to define the particulars of what we're doing.

We're transforming the <body/> of the TEI document. If we were to transform the entire dcument, including the header, then we'd be in a bind since the transformation would have to encode the algorithmic provenance in the resulting transformed document. This is a variant on the theme of quines.

In our transformed document, we can add something like the following XML snippet to the <editorialDecl/>:

Making it Real

This isn't just a bit of naval gazing. I'm working on a engine that would be able to do something with the above RDF. Using the syntax of Dallycot, the part inside <prov:wasDerivedFrom/> is equivalent to the following snippet.

Even without the existance of an engine like Dallycot, the linked data aspects of the derivation buy us long-term understanding of what was done. The meaning of http://example.com/transforms/xslt#transform can be well-defined in a way that lets us understand what we intended by its invocation. If we ever mean something else by the concept of an XSLT transform, we will give that concept a different URI and use that in describing the provenance.

Furthermore, the URI of the function could be the URL of a document that defines the function. So a representation of the XSLT transformation engine could be stored at http://example.com/transforms/xslt#transform. An engine like Dallycot could fetch and execute that representation if it didn't already have an internal definition that matched.

Somewhere From Here

We use a lot of different transformations and algorithms in digital humanities. We don't have to have coded definitions for all of them in the same language. We just need well-written definitions that allow us to understand their expected and observed behavior.

One of the beauties of linked data is that it is a grassroots effort. Linked data gives us the tools we need to talk about things, but doesn't dictate or restrict what we can talk about. We don't even have to use RDF/XML if that's not a natural format for the dataset we're working with.

Rather than wait for some central committee to bless a set of algorithms that everyone uses, each tool maker can publish a document describing a well defined vocabulary that lets us talk about what their tool does.

The text collation community has a jump start on this with the Gothenburg model for the text collation process.

For a given collation, we can define a tokenizer, an aligner, and an analyzer. The collation process could be represented in Dallycot by the following (collate, words, expand-contractions, remove-inflection, needleman-wunsch, and transposition-finder are all references to functions).

The resulting varient graph could have an RDF representation of the above as part of its provenance metadata. Tools like Juxta and CollateX should be able to do this without too much effort today.

Conclusion

We no longer have to talk about someday being able to record the algorithmic provenance of published datasets. Using linked data techniques, we can record what we do in a way that can be analyzed and reproduced independent of the language used to implement the transformations.

Our tools will be able tell us how a particular dataset was created. For example, Juxta could examine a sequence alignment produced by CollateX and apply the same transformation to a different set of source documents. This is a way to let us say, "Do to my documents what someone did to theirs to produce this result."