Annotating Data

I just checked in some changes that show ‘with’ working, at least to some degree. More use cases will help flesh it out.

The ‘with’ keyword in expressions is used to add information to nodes without changing which nodes are being returned by an expression. This is useful for annotating data while passing the data on to another function for further processing. In the context of concordances, this means we can annotate the words and then pass the list of words on to a function that combines the lists into a larger list. This lets us break the concordance process up into smaller steps that can work on a particular line or page of a manuscript. We can attach information to each word as to which line or page it was found on and then retain that information as we combine concordances of pages into concordances of manuscripts.

This lets us do the following:

That should (hopefully) build the concordance data for a particular manuscript (given as the URL in $manuscript). Note that the c:concordance function only compiles a list of word frequencies for the given text. There’s not a lot of other magic going on.

Accessing concordance data

I’m making good progress on the concordance front.  I can now do the following:

c:concordance(“some text”)/some/count

This will convert the string to a concordance object (implicitly compiling the concordance) and give the the frequency/count of the word “some”.

My goal now is to get the “with ./*/foo := bar” fragment working with an internal Ruby representation of a concordance.  This will allow me to annotate the words in a concordance.  The internal object already preserves annotations when combining concordances.

Once I have a good serialization format for the concordance, I will be able to persist the concordance in some form — perhaps RDF, but not necessarily so.  That combined with annotations of where a word appears will let me do searches of words:

c:concordance($doc)/foo/line

to give me the lines on which a word appears.

c:concordance($doc)/*[f:matches(node-name(.), “^f”]/line

to give me the lines on which a word appears beginning with the letter “f”.

At that point, the challenge will be to optimize these idioms so they don’t take forever to run.

Concording bits and pieces

At a basic level, a concordance is an annotated word frequency chart.  Presentations and corpus might differ across projects, but any project with a concordance manages tables of words tied to frequencies and other information.

My current work is to set up an environment where we can say something like the following.

f:sum( for $line in $doc/* return c:concordance($line) with ./*/line = $line )

That might look a bit daunting, but let’s unpack it.

The outer function is stating that we want to accumulate concordances to build an overall concordance.  The sum of two concordances is a single concordance with combined counts and accumulated annotations (e.g., combined references to lines).

The ‘for $line in $doc/*’ is a way of going through every line in a document (in this case, a transcription in the Donne project).  This will depend on how concorded documents are managed and how fine grained you want your location tracking.  The ‘return …’ part just says that we want to build a list of things based on what we’re looping over (in this case, the ‘$line’).

The ‘c:concordance($line)’ builds a concordance object with word frequencies for the given ‘$line’.  We add a reference to the line with the ‘with ./*/line = $line’ part, which runs the expression against the concordance, adding a reference to the line to every word in the concordance.  The children of a concordance are the words in the concordance, so we end up adding the reference to each word.  The reference gets carried through the f:sum resulting in one large concordance with word frequencies and annotations of every line containing a particular word.

Extension functions can be written as expressions

I’m working on some functions that can be useful for a concordance.  Right now, that’s a function to give me the frequency of each word in a text.

I’ve successfully defined it as follows:

This allows functions to be written in terms of previously defined functions, using all of the expressiveness of the Fabulator expression language.

For example, if I defined the XML prefix ‘c’ to correspond to the namespace in which the ‘count-words’ function is defined, then I could run the following expression:

And then ‘$i/of’ would equal 3 because the word ‘of’ appears three times.

Fabulator XML functions on github

I created a github repository for the XML extensions to Fabulator.  This is the first step towards a set of general purpose functions for managing TEI documents.  With this and recent changes to the core Fabulator Radiant extension, we can browse TEI documents in the CMS and extract information from within the document.

These two functions move us much closer to being able to extract geographic information from documents for use in the Digital Concord project.

Fabulator sees Radiant

The Fabulator is a way to build interactive applications within Radiant.  I’m using it as the framework for building several DH projects this semester that work with transcriptions of textual artifacts.  I just pushed a set of changes that allow the fabulator applications to traverse the page hierarchy and access page content (for example, ‘radiant::/@title’ gives the title of the home page).  This means that we can manage TEI documents as pages in the CMS and still process them to extract information for a database, all within the CMS.

Scholarly Software Editions

The NEH and other U.S. federal government agencies are pushing the digital humanities projects to result in something that can be shared. If this is an application that people can use, especially an application that resides on a central server, then the NEH is also wanting provisions for long-term maintenance. Ultimately, digital humanities projects should seek to be a resource that other scholarly work can build on. In this post, I want to explore what this might mean for web-based applications.

Continue Reading Scholarly Software Editions