English: Stream
English: Stream (Photo credit: Wikipedia)

While I am trying to round out the content management aspects of OokOok this year, I'm starting to think ahead to next year's work on databases and processing. Part of the goal is to offer a platform that lets you take advantage of parallel processing without requiring that you be aware that you're doing so. Of course, any such platform will be less powerful than hand-coding parallel code in C or using your favorite Hadoop library. Less powerful is better than not available. I want OokOok to make available capabilities that would otherwise be hidden away.

Map/reduce seem like the simplest way  to think about parallel processing. We have two kinds of operations: those that look at one item at a time (mappings), or those that have to see everything before they can finish their calculation (reductions). Reductions can get by seeing one item at a time if they can keep notes on a scratch pad. We could put operations then into two slightly different camps: those that need a scratch pad (reductions) and those that don't (mappings).

That's as far as we'll go with the scratch pad side of things. I want to explore what we can do with the other side of these operation definitions: the items. In both cases, we said that operations can look at items one at a time as long as the operation had a scratch pad if it needed one. This leads to the idea that operations can use streams of items. Like a rock in a river, the operation sits in one place and observes the items coming by, replacing them with new items if needed.

A stream might snake around several operations to produce a result. If the operations live on different systems, then we have the makings of distributed processing since we've distributed the processing across different systems (and necessarily distributed the streams to connect the operations).

How do we construct such a stream of items? The simplest way is to say that we have a first value and the rest of the stream, and that we'll worry about the rest of the stream when we have to. We're lazy, but not too lazy. Just lazy enough to make everything work without too much extra work.

We can borrow from Perl (and some other languages) and say that a stream with a first element and the rest of the stream looks something like this when written out:

The "..." stands in for the rest of the stream, whatever that might be.

We need some way to use these values. Let's use a prime (') for the first item in the stream and the ellipses (...) for the rest of the stream.

I've used ":=" to show assignment and "=" to show equality.

There are a few things we can do with the stream now, such as altering it a bit.

I've introduced a few new things here as I defined the "set_first" and "set_rest" functions.

For "set_first", we're saying that it takes two values, a single stream, which we call "s" (the trailing underscore indicates that it is singular), and any single value, which we call "sh". It produces a new stream consisting of the "sh" value and the rest of the "s" stream. Think of ":>" as saying that if we find someone who looks like the left side, we can replace it with the right side, but not the other way around.

The way to read "set_rest" is the same, but we're taking the first value of the "s" stream and putting it ahead of the "st" value, which could be a stream or anything else, to produce a new stream.

The thing to notice here is that we haven't altered the original stream. We've constructed a new stream. One of the fundamental aspects of a distributed system is that it needs to share as little as possible. If we allow an operation to change a stream, then we have to send changes back to other systems. Instead, we simply send along a different stream than the one we received. Or, for functions, we return a different stream than the one we were given.

What if we wanted to create a new stream with a different value between the first value and the rest of the stream?

This hand-wavy notion of the rest of the stream is handy when we want to define something without having to compute it completely. For example, if we wanted a stream of integers starting with a particular number, we might say it looks like the following:

That is, the stream of integers starting with "n" is just "n" followed by the stream of integers starting with "n+1". We don't want to have to calculate that stream just to create it. We want to get each number from the stream as we need it. So we can say that the rest of the stream (the "upfrom(n+1)") is a promise to give us the rest when we need it.

We can write "upfrom(n)" as "n.." to make it easier to read. We can also define a function "upto(m,n)" with the shorthand, "m..n", to show a finite stream of integers running from "m" to "n". Both are useful and common in programming.

This concept of a promise is powerful. It lets us specify what we want the computer to do, but delay doing it until a later time. It's natural to say that when we define a stream, that the last thing in the stream is a promise if it's not a constant value. So [1,2,3] wouldn't contain a promise because it's completely defined by constants, but [1, 2, upfrom(3)] would have a promise at the end. What do we do about [1, upfrom(3), 4] ? For now, let's consider that invalid since we would calculate upfrom(3), find a promise, and never get to 4.

I mentioned mappings as one of the operations on a stream. We've done some stream re-creation, but not anything that looked at every item in the stream and did something with it. Let's see what a transformation might look like.

That says that we'll create a new stream by producing twice the first value of "s" with a promise to double the rest of "s".

Let's look at another transformation.

It looks just like our "double" but with a 3 instead of a 2. We could define a new function that handles this difference:

Is there a way to make this more general? Perhaps if we could give it a promise that we'll calculate the value if it will just look at each item in the stream:

Since the promise isn't indicating the rest of a stream, we need some way to make it stand out as something not to be evaluated yet. We also need a way to talk about the item we're looking at. Perhaps something like this:

We've borrowed a bit of notation from XPath. The dot (.) indicates the current item we're examining. The curly braces ( {} ) often show that we want to compute the value instead of providing a literal, constant value.

What if we wanted all the odd multiples of 3? We can find the multiples of three simply by composing our "triple" function with our "upfrom" function:

This will give us an endless stream of triples.

To find all the odd ones, we can filter the stream, allowing only odd numbers into the new stream:

We've introduced some new notation here with the colons. What we're saying is that if the condition on the left of the colon is true, then we want the expression on the right side. As soon as we find a left condition that is true, we take the right expression and we're done.

So for the "odds" filter, we look to see if the first value in the stream is even (no remainder when divided by two). If so, we give back the list of odds from the rest of the stream. If the first value is not even (it's odd), then we build a new stream consisting of this odd value and then the odds from the rest of the stream.

The result is that we go through the stream until we find an odd value. At that point, we produce a new stream with the odd value as the first value and a promise that we'll look through the rest of the stream for odd values when we need to.

As before, we can generalize this:

Let's see what we have in our toolbox now (along with the Lisp equivalents):

  • Stream constructor (CONS): [ ]
  • First value of a stream (CAR): '
  • Rest of a stream (CDR): ...
  • Construct a promise (Quote): { }
  • Conditional branching (COND): (condition): (expression)
  • Equality testing (EQ): =
  • Function definition (defun): :>

We've just constructed our own Lisp dialect. It's based on the idea of streams and distributed operations on those streams, but Lisp is Lisp. What we can do with [], ', ..., {}, and : has already been done in Lisp, so we're not inventing anything new. This is a good thing: it shows that we're probably on the right track. Building it up from a stream base helps me understand how I might work towards an implementation.

I wouldn't expect anyone to use this directly, but it gives us a way to express what we want to do in a way that can work with distributed processing. There are opportunities for syntactic sugar and computer mediation to translate a problem description into a solution in terms of these operations.

This post is long enough for now. I'll dive into what we might do with this later, but streams might be a powerful way to bring parallel, distributed computing to the humanities in a way that doesn't need deep knowledge of such computing to leverage it.

One thought on “CAR, CDR, CONS, Oh My!”

Comments are closed.