Today, let's build a set of tools that will help us create a concordance of a text. We'll have to make a lot of assumptions so that we can see the core pieces, so keep in mind that any real implementation will probably have different details.
We'll assume for now that we have a stream of characters representing the text. We haven't discussed where we get data or where we store it yet. That's for another time. For now, we're focused on what we do with the data between getting and storing it. If we can wrap our minds around what to do with the data, then we can plug-in any data retrieval or storage we want onto our processing later.
While I am trying to round out the content management aspects of OokOok this year, I'm starting to think ahead to next year's work on databases and processing. Part of the goal is to offer a platform that lets you take advantage of parallel processing without requiring that you be aware that you're doing so. Of course, any such platform will be less powerful than hand-coding parallel code in C or using your favorite Hadoop library. Less powerful is better than not available. I want OokOok to make available capabilities that would otherwise be hidden away.
Map/reduce seem like the simplest way to think about parallel processing. We have two kinds of operations: those that look at one item at a time (mappings), or those that have to see everything before they can finish their calculation (reductions). Reductions can get by seeing one item at a time if they can keep notes on a scratch pad. We could put operations then into two slightly different camps: those that need a scratch pad (reductions) and those that don't (mappings).
I'm making rapid progress in getting OokOok to a stable programming scheme. I haven't made a lot of changes in its capabilities, though I did add the ability to archive themes and projects as Bagit files yesterday, I've been working on making the important stuff declarative. By hiding all the details behind a veneer of relationships, I can fiddle with how I manage those relationships without having to touch every relationship every time I make a change in the underlying schemas (and schemes).
For those used to an older style of Perl programming, this might come as a surprise. For those who have dealt with things like MooseX::Declare and CatalystX::Declare, you'll be shaking your head at my foolhardiness in jumping into making an OokOok:Declare that hides the details of how to construct certain types of classes.
Behind the scenes, OokOok consists of controllers, models, views, REST collections/resources, SQL result objects, a template engine, and tag libraries for the templates. Almost two hundred classes in all.
If I built all of these the usual Perl way, there'd be a lot of boilerplate code around. By moving to a declarative approach, I can isolate all the boilerplate in a few core meta-classes. When the boilerplate has to change, I only have to touch one place. Everything else comes along for the ride.
For the rest of this post, I want to walk through how I use some of these declarative constructions. I won't get into the scary details of how to make declarative constructions in Perl (at least, not in this post).
In my last post, I talked some about the need to look across projects and find common elements that could be factored out. I'd like to start a series of posts in which I talk about some of the work I'm doing at MITH in developing some foundational libraries that we are using to build digital humanities projects. Along the way, I'll discuss some of the philosophy behind those libraries and our approach to the projects.
Miriam Posner's post, "Some things to think about before you exhort everyone to code," has touched off a series of conversations on twitter and elsewhere. My own feeling is that she's nailing some things square on the head and, fortunately, doesn't conclude saying that we should banish coding from the digital humanities. We just need to be careful how we cast the need for coding.
I've tossed around a nugget in my mind for the last few weeks, and Mariam's post is making me focus more intently on it: A digital humanist afraid of the digital is like a scholar of French literature who is afraid of French. You can't be a digital humanist if you don't understand the digital. That doesn't mean you have to be able to code any more than being a scholar of French literature means you have to be able to write French literature. You just have to be able to understand the nuances of what you're studying and how you are studying it. Otherwise, how can you properly interpret the results?
I'm working through some ideas on how to move the Utukku/Fabulator expression language more into a descriptive, functional style. I want to be able to have the programming be exposed as an editorial statement showing how certain calculations are done or inferences are drawn. The computer's interpretation of the data can be as important as a person's, and knowing what the person was expecting the computer to do can be as important as knowing what the person thought they wanted the computer to do.
With that in mind, I want to walk through a few possible ways of constructing phrases and inference rules to see how they go. Since my stereotypical example seems to be a concordance, that's where I hope to end up.