There's Something Odd About Emma

Part of the LHC at CERN, an experimental endeavor
Image via Wikipedia

During the first week of March, a group of humanities scholars and developers gathered in the bowels of the McKeldin library on the University of Maryland campus.  You can see the website that talks about the results on the MITH website.  We held Corpora Camp to test some ideas on distributed humanities computing that will eventually feed into the next phase of Project Bamboo. The result is an application for exploring texts called “woodchipper.”

While most of my focus is on the architecture, I want to explore what we’re beginning to enable.  If you have experience with Scrivener and Mathematica, then you have some idea of where I’m headed.  I won’t flesh everything out here.  That will have to wait for my Digital Dialogues talk at the end of April titled, “Player Piano: Mechanizing the Humanities.”

I tend to build new systems from the bottom up. This lets things emerge that I might not have thought about. It also means that I can put off until tomorrow what doesn’t need to be decided today. I just make sure I don’t do something today that might have to be undone tomorrow, or is so fundamentally wrong that I have to start over tomorrow. A sign that a system is designed well is that someone else can use it to do something that the creator didn’t think about.

Digital Humanities is all about doing research: discovering new knowledge. It doesn’t make sense to tell the researcher that they can only ask questions that the programmer has already thought about. That’s not how physics instruments are designed in astronomy or in particle physics. Telescopes and colliders are designed to allow collection of as much data as possible. The analysis and interpretation are separate, and might involve questions that weren’t thought of when the instrument was built. In fact, the computing infrastructure for the Large Hadron Collider is designed to focus on the questions we haven’t thought to ask yet because it weeds out data based on the questions we’ve already asked. That is, the LHC compute grid filters out uninteresting events and passes along interesting events: events that surprise physicists.

This isn’t all that different than what humanists are looking for. If you look at the output of the woodchipper, you’ll see a lot of expected results, but you’ll also see the occasional surprise. It’s the surprise that is interesting and warrants further research. This is why the woodchipper is designed to let you drill down and see what part of the text is producing that surprise.

As part of the process of building the woodchipper, we built a console application that lets us interact with the various pieces before they are assembled into the final interactive application that we are calling the woodchipper. This console lets us ask questions that aren’t part of the woodchipper design by letting us formulate our own questions using an expression language that ties all of the pieces together. The expressions you’ll see in the rest of this post are written in that expression language.

One of the interesting things I’ve found exploring the collections is that the mention of Emma seems to depend on the parity of the year, but not the mention of shipper, skipper, or Edwin.

Word Even Years Odd Years Difference
Emma 101 121 20
Edwin 130 128 2
shipper 7 6 1
skipper 29 30 1

The command for each of these is pretty simple:

The first is for even years (the date divided by zero results in a remainder of zero). The second is for odd years. I replaced “edwin” with each of the words in the table.

What the code does is first search for documents which contain the word “edwin” in the text.  It then selects those documents with a publication date that is either even or odd, depending on if we compare the remainder to zero or one.  Finally, we count how many documents make it through the filter.

Let’s try another experiment. This time, I want to see how many documents mention “emma” and have a particular remainder after the year is divided by a particular number. Let’s say that the number is $n and that the remainder is $r. Then, I’m running the following command for each cell in the table and reporting the number.

$n mean $r=0 1 2 3 4
2 111 10 101 121
3 74 10 67 66 89
4 55.5 11 36 57 65 64
5 44.4 5 45 42 44 37 54

None of these numbers really mean anything except that we need to dive into our text collections and figure out what’s causing the unexpected results. It could be bad OCR, a bias in the text selection, or something else that has nothing to do with what happened historically. Only after we rule out this “instrumentation error” can we start looking for possible explanations for why publishers were biased against works about Emma in even years.

If we wanted to see how many documents were in each year, we could replace the f:count with a f:histogram.  For example, with “skipper”, we can do

and get the table:

Year Count Year Count
1609 1 1766 1
1669 1 1767 1
1676 1 1768 2
1681 1 1769 1
1694 1 1770 3
1701 1 1775 1
1717 1 1776 1
1722 3 1777 1
1726 6 1778 1
1727 2 1780 1
1743 1 1782 1
1746 1 1790 1
1747 3 1793 1
1749 1 1794 1
1753 2 1796 1
1755 1 1797 2
1757 1 1798 3
1763 1 1799 5

This is the beginning of Google’s n-gram viewer.  Of course, it would be easier to take in the numbers if I plotted them, but I don’t have that bit of code written yet.

In the end, a system like this does nothing more or less than what we see the big dollar physics experiments doing: provoking us to ask questions and look for answers.

Eventually, we’ll release all of this code. Right now, we’re doing our last bit of integration and development to get the alpha version of the woodchipper up and running.

Enhanced by Zemanta

2 thoughts on “There's Something Odd About Emma”

Comments are closed.