Concording bits and pieces

At a basic level, a concordance is an annotated word frequency chart.  Presentations and corpus might differ across projects, but any project with a concordance manages tables of words tied to frequencies and other information.

My current work is to set up an environment where we can say something like the following.

f:sum( for $line in $doc/* return c:concordance($line) with ./*/line = $line )

That might look a bit daunting, but let’s unpack it.

The outer function is stating that we want to accumulate concordances to build an overall concordance.  The sum of two concordances is a single concordance with combined counts and accumulated annotations (e.g., combined references to lines).

The ‘for $line in $doc/*’ is a way of going through every line in a document (in this case, a transcription in the Donne project).  This will depend on how concorded documents are managed and how fine grained you want your location tracking.  The ‘return …’ part just says that we want to build a list of things based on what we’re looping over (in this case, the ‘$line’).

The ‘c:concordance($line)’ builds a concordance object with word frequencies for the given ‘$line’.  We add a reference to the line with the ‘with ./*/line = $line’ part, which runs the expression against the concordance, adding a reference to the line to every word in the concordance.  The children of a concordance are the words in the concordance, so we end up adding the reference to each word.  The reference gets carried through the f:sum resulting in one large concordance with word frequencies and annotations of every line containing a particular word.

Extension functions can be written as expressions

I’m working on some functions that can be useful for a concordance.  Right now, that’s a function to give me the frequency of each word in a text.

I’ve successfully defined it as follows:

This allows functions to be written in terms of previously defined functions, using all of the expressiveness of the Fabulator expression language.

For example, if I defined the XML prefix ‘c’ to correspond to the namespace in which the ‘count-words’ function is defined, then I could run the following expression:

And then ‘$i/of’ would equal 3 because the word ‘of’ appears three times.

Fabulator XML functions on github

I created a github repository for the XML extensions to Fabulator.  This is the first step towards a set of general purpose functions for managing TEI documents.  With this and recent changes to the core Fabulator Radiant extension, we can browse TEI documents in the CMS and extract information from within the document.

These two functions move us much closer to being able to extract geographic information from documents for use in the Digital Concord project.

Fabulator sees Radiant

The Fabulator is a way to build interactive applications within Radiant.  I’m using it as the framework for building several DH projects this semester that work with transcriptions of textual artifacts.  I just pushed a set of changes that allow the fabulator applications to traverse the page hierarchy and access page content (for example, ‘radiant::/@title’ gives the title of the home page).  This means that we can manage TEI documents as pages in the CMS and still process them to extract information for a database, all within the CMS.

Scholarly Software Editions

The NEH and other U.S. federal government agencies are pushing the digital humanities projects to result in something that can be shared. If this is an application that people can use, especially an application that resides on a central server, then the NEH is also wanting provisions for long-term maintenance. Ultimately, digital humanities projects should seek to be a resource that other scholarly work can build on. In this post, I want to explore what this might mean for web-based applications.

Continue Reading Scholarly Software Editions

XML Transformation Creation

I was looking around the web for references about EAD, an XML vocabulary mentioned in a Digital Humanities Working Group meeting Monday. I could see cases where people would want to have documents marked up with both TEI and EAD.

XSLTs basically describe a function that is applied to an XML document resulting in another document (not necessarily XML): D = f(X) where X is a subset of D (for a particular document, I’d say: d = f(x)). We usually are given X and f and asked for D, but I’m wondering if we could be given D and X and find f

This is definitely a pure computer science problem, but it has digital humanities applications. A web search shows some work in this direction, but usually having people manually map elements between the two document sets to generate the XSLT.

Another thing that would come from this is a way to rank XML vocabularies based on their expressive range. If we have two sets of documents (A and B) based on two different XML vocabularies, then if an XSLT exists that maps A -> B, but no XSLT exists that maps B -> A, then the vocabulary for A could be seen as having a larger expressive range than that used for B. That would let us have a more solid foundation for saying that TEI is more expressive than Docbook (which I believe it is, but don’t have good data to base that belief on at the moment).

I can manually create XSLTs to go from TEI to Docbook to HTML because I believe there’s a loss of information from one format to the next (ignoring the pushing of that information into CSS at the final HTML stage) and because Docbook is a publishing vocabulary and HTML is, with CSS, a de facto typesetting vocabulary. The information isn’t so much lost as transformed from semantic to presentation, with the person reading the resulting document adding back the semantic information based on the presentation. The semantic information though is removed from a readily computer-understood form: it’s gone from a context-free to a context-dependent form.