Streams, Part II

In the last post (back in 2012—this post has been in draft for some time), we talked a bit about streams. Eventually, I want to see how we might use streams to write a parser that can parse the language we’re developing to talk about streams, but let’s start with something a little simpler. The textual digital humanities revolve around text, and the really fun stuff involves textual manipulation if not parsing.

Today, let’s build a set of tools that will help us create a concordance of a text. We’ll have to make a lot of assumptions so that we can see the core pieces, so keep in mind that any real implementation will probably have different details.

We’ll assume for now that we have a stream of characters representing the text. We haven’t discussed where we get data or where we store it yet. That’s for another time. For now, we’re focused on what we do with the data between getting and storing it. If we can wrap our minds around what to do with the data, then we can plug-in any data retrieval or storage we want onto our processing later.

Given a stream of characters, we want to break it into a stream of words. Essentially, read in characters until we find a space character of some kind (a space, a newline, a tab, etc.), and pass along what we’ve read, promising to find the next word when we need to. Something like the following:

Now, we can give words a stream of characters and get out a stream of words.

Of course, for a concordance, we want to know how many times the words appear, and where they appear. There’s metadata in them there words, so to speak, and we need to keep track of it.

One option would be to build a structure for each word, but this focuses our attention on the structure and away from the words. If we’re working on words, we want to stay focused on them while still allowing us to associate metadata with those words.

One way to do this is to annotate the stream of characters with the information we’re interested in. Perhaps we want to know page and line numbers, with pages delimited by form feeds and lines delimited by newlines (line feeds on Unix). We might build the following that takes a stream of characters and annotates it.

Our stream is still the stream of characters, but now the characters have the line and page annotating them. If we keep metadata when we combine characters into strings (such as we do when adding them to the end of the buffer when turning the stream of characters into a stream of words), then our words will end up with the page and line numbers they appear on.

 The last thing we need to compile the core data with which we can build a concordance is to collect for each word the page and line number metadata. We can do this by partitioning the stream of words so that we get a different stream of metadata for each word. There are some ways that this could be done, and in any real system, the ability to partition a stream would likely need to be a fundamental operation for efficiency. For now, we’ll specify the pattern we’re looking for and let the hypothetical system figure out how to make everything work.

The result is a record with each word as a property pointing to a stream of location records. We’re assuming that the system will recognize that collectLocations(s...) should refer to the same computation in both locations. That and the pipe operator keep us from needing mutable variables.

One reason I’m resurrecting this post after letting it sit idle for almost a year is that the W3C recently formed a community group around RDF stream processing. I’ll post updates here once in a while as conversations progress. Stream processing is a fundamental computational model when working with web resources and thus critical to the semantic web as a replacement platform for traditional localized computing.


Parts of this series of posts on streams and OokOok are inspired by Mark Jason Dominus’s Higher-Order Perl. There are critical differences, so don’t expect a straight port from Perl to whatever we’re calling the stream-oriented language we’re developing for OokOok. Nor should you expect the implementation of OokOok to look anything like what’s in Higher-Order Perl. The book is a convenient and well-organized discussion of concepts that are fundamental to distributed, parallel computation as envisioned for OokOok.

Published by

James

James is a software developer and self-published author. He received his B.S. in Math and Physics and his M.A. in English from Texas A&M University. After spending almost two decades in academia, he now works in the Washington, DC, start up world.