The RDF equivalent of “If you can’t say anything nice, don’t say anything at all” is “If you can’t assert something, then don’t assert anything at all.”
I’m building a language designed to be natural with linked data just as programs today are natural with local memory. The result is highly functional and data-relative in nature and reminds me of how XSLT works relative to XML nodes (e.g., a current node, child nodes, ancestor nodes). I have quite a few tests passing for the parser and core engine, so now I’m branching out into libraries and seeing what I can do with all of the pieces.
A few months ago, I accepted a job outside the academy. This doesn’t mean that I’m abandoning digital humanities. In this post, I lay out what I want to do in DH going forward. The common thread through all this is that I believe linked open data is the best way to break down the silo walls that keep digital humanities projects from sharing and build on existing data.
Continue Reading My DH Agenda
With the recent postings elsewhere about Markov Chains and text production, I figured I’d take a stab at it. I based my code on the Lovebible.pl code at the latter link. Instead of the King James Bible and Lovecraft, I joined two million words from sources found on Project Gutenberg:
- The Mystery of Edwin Drood, by Charles Dickens
- The Secret Agent, by Joseph Conrad
- Superstition in All Ages (1732), by Jean Meslier
- The Works of Edgar Allan Poe, by Edgar Allan Poe (five volumes)
- The New Pun Book, by Thomas A. Brown and Thomas Joseph Carey
- The Superstitions of Witchcraft, by Howard Williams
- The House-Boat on the Styx, by John Kendrick Bangs
- Bulfinch’s Mythology: the Age of Fable, by Thomas Bulfinch
- Dracula, by Bram Stoker
- The World English Bible (WEB)
- Frankenstein, by Mary Shelley
- The Vision of Paradise, by Dante
- The Devil’s Dictionary, by Ambrose Bierce
In the last post (back in 2012—this post has been in draft for some time), we talked a bit about streams. Eventually, I want to see how we might use streams to write a parser that can parse the language we’re developing to talk about streams, but let’s start with something a little simpler. The textual digital humanities revolve around text, and the really fun stuff involves textual manipulation if not parsing.
Today, let’s build a set of tools that will help us create a concordance of a text. We’ll have to make a lot of assumptions so that we can see the core pieces, so keep in mind that any real implementation will probably have different details.
We’ll assume for now that we have a stream of characters representing the text. We haven’t discussed where we get data or where we store it yet. That’s for another time. For now, we’re focused on what we do with the data between getting and storing it. If we can wrap our minds around what to do with the data, then we can plug-in any data retrieval or storage we want onto our processing later.
While I am trying to round out the content management aspects of OokOok this year, I’m starting to think ahead to next year’s work on databases and processing. Part of the goal is to offer a platform that lets you take advantage of parallel processing without requiring that you be aware that you’re doing so. Of course, any such platform will be less powerful than hand-coding parallel code in C or using your favorite Hadoop library. Less powerful is better than not available. I want OokOok to make available capabilities that would otherwise be hidden away.
Map/reduce seem like the simplest way to think about parallel processing. We have two kinds of operations: those that look at one item at a time (mappings), or those that have to see everything before they can finish their calculation (reductions). Reductions can get by seeing one item at a time if they can keep notes on a scratch pad. We could put operations then into two slightly different camps: those that need a scratch pad (reductions) and those that don’t (mappings).
I’m making rapid progress in getting OokOok to a stable programming scheme. I haven’t made a lot of changes in its capabilities, though I did add the ability to archive themes and projects as Bagit files yesterday, I’ve been working on making the important stuff declarative. By hiding all the details behind a veneer of relationships, I can fiddle with how I manage those relationships without having to touch every relationship every time I make a change in the underlying schemas (and schemes).
For those used to an older style of Perl programming, this might come as a surprise. For those who have dealt with things like MooseX::Declare and CatalystX::Declare, you’ll be shaking your head at my foolhardiness in jumping into making an OokOok:Declare that hides the details of how to construct certain types of classes.
Behind the scenes, OokOok consists of controllers, models, views, REST collections/resources, SQL result objects, a template engine, and tag libraries for the templates. Almost two hundred classes in all.
If I built all of these the usual Perl way, there’d be a lot of boilerplate code around. By moving to a declarative approach, I can isolate all the boilerplate in a few core meta-classes. When the boilerplate has to change, I only have to touch one place. Everything else comes along for the ride.
For the rest of this post, I want to walk through how I use some of these declarative constructions. I won’t get into the scary details of how to make declarative constructions in Perl (at least, not in this post).
OokOok is coming along nicely. It’s been a couple of months since the last update, so I’ll outline a bit of what I’ve done since the last post. I’m nowhere near being able to throw up a demonstration server for anyone to play with, but I’m getting closer. With a little more testing, a reasonably decent administrative interface, some simple themes, and full authorization management, we’ll be good to go on a first demo. I’m aiming for the end of the year. I’m trying to think about what a good, simple demonstration project might be that is just text on-line. Perhaps a curated collection of creative-commons licensed works on a subject?
OokOok isn’t meant to do everything for everyone. I’m designing it with opinions. I think they are well researched and thought out opinions, but they are opinions. I hope the pros can outweigh the cons, but that’s something you’ll need to decide when considering which platform to use for your project.
I’m designing the system to enable citation, reproduction, sustainability, and description. You should be able to point someone at exactly the version of the page that you saw (citation), be able to see the same content each time you view that version of the page (reproduction), see that content “forever” (sustainability), and leverage computation through description (composing the rules) instead of prescription (composing the ways). I’ve based all the opinionated choices in the system on trying to meet the needs of those four “axioms.”
Last week, I talked about the basic model I’m considering for managing static web content in a way that lets us find it based on when we looked at it. The idea is that if I want to cite something, I should be able to point at what I’m citing and know that someone else following my citations will see the same thing I did.
Today, I want to explore what it means for something to be citable.
I come from the sciences, where citation is a shorthand for bringing in a body of work that you don’t want to reproduce in your text. It’s like linking in a library in a program. You’re asserting that something is important to your argument and anyone can find out why they should believe it by following the citation. You don’t have to explain the reasoning behind what you’re referencing.
If you use citations to give shout outs to people in your field, then you don’t need what I’m thinking about. Readers understand that these citations are to remind them about the other people and their body of work, not the particular passage pointed to in the citation. The details aren’t important enough to look up.
I’m interested in the citations that people need to follow.
I’ve made some good progress on the OokOok project over the last week. The system has a minimal page management interface now, so you can create, edit, and delete new pages and place them in the project’s sitemap. You can create project editions that freeze the content in time, and you can see the different versions of a page using time-based URLs.
You know you have a real software project when you have a list of things that won’t be in the current version. So it is with OokOok. Eventually, I want to support any dynamic web-based digital humanities project and allow it to run forever without any project-specific maintenance. For now, I’ll be happy creating a simple text content management system that has all the time-oriented features. We can add support for algorithms later.
Today, I want to talk a bit about the model I’m using to keep track of the different versions and the impact this has on the user interface.