Some Thoughts on Designing a Linked Open Code Schema

As I start thinking about what should go into the core of a linked open code schema, I'm tempted to put a lot of high-level operations into the core so they run faster. History tells us that's the wrong way to go.

CISC v. RISC

CPU instruction sets were dominated by two camps: CISC (complex instruction set computing), and RISC (reduced instruction set computing). CISC CPUs had complex instructions for manipulating strings, copying blocks of memory, and managing return and data stacks. CISC chips tended to have CPU registers dedicated to certain tasks
(e.g., a stack pointer).

RISC CPUs on the other hand had the simplest instruction set necessary to operate a computer (not by number of instructions but by amount of responsibility any instruction had), but often with a large number of registers resident in the CPU that could be used for any purpose, as stack pointers, accumulators, or for something the engineers hadn't thought of.

In the end, RISC CPUs could get more done with simpler instructions because each of the small parcels of work could be interleaved with other small parcels of work by a compiler. CISC CPU instructions were all or nothing and couldn't be decomposed. For example, the CISC instruction to copy a string (perhaps something like CSTRZ AB, CD, assuming the string was null-terminated) could be written as a loop on a RISC system:

That's a simple example using a fictitious assembly language.

A Look at Folding

A linked code schema is comparable to a machine language. When building out a set of linked data types to represent code, it's tempting to put in a lot of types that represent optimizations or complex operations, even when these can be written as compositions of more basic operations. The lesson from CPUs is to avoid this.

So instead of having an atomic operation to fold a stream, we can write a function in terms of definedness, lambda application, comparison, and stream manipulation:

Now, when I write foldl( 0, { #1 + #2 }/2, 1..50), I get a stream of numbers representing the sums from 1 to whichever number I'm on: [1, 3, 6, 10, ..., 1128, 1176, 1225, 1275].

I could create a function that gives me the last element in a list:

Of course, this will run forever if the stream never terminates, but combined with foldl, I can calculate the sum of a stream of numbers:

This is also a more natural way to reduce a stream. Just pull elements off the stream produced by folding until you get to the end or as much precision as you need (e.g., if using something like folding with Newton's method).

Just because I don't want to implement something as a core operation doesn't mean I don't want some syntactic sugar in Dallycot. For folding, we have:

It's like a mapping, but it carries some information from one item to the next as it moves down the stream. It's like we're shoveling the stream into the initial value by way of the intermediate lambda.

For streams, there can't be a right fold because we'd have to find the end and work backwards. Streams are considered semi-infinite. For now, we're only considering what is possible with semi-infinite streams. Finite streams, such as vectors, let us consider the entire data set.

The parser/compiler still transforms the ... << ... << ... syntax into an application of a function like the above foldl.

Isn't this a lot of code to be passing around just to do a fold? Not really. In the world of linked code, we can stash that function at a URL and reference it:

Dereferencing the URL should give us a lambda that we can evaluate with the given bindings.

More likely, we'd define the function at the URL something like the following:

That looks like the beginnings of a JSON-LD linked code library to me. It also looks like reduction or folding don't need to be axiomatic operations.

Bonus Material

Here are some more functions.

min and max provide the minimum or maximum value seen so far as the head of their resulting streams.

sum and product are trivial applications of foldl. They both return a stream of accumulators representing the sum or product up to that point.

weighted-count-and-sum expects a stream of duples (<weight,value>) and produces a stream of summed weights and weight/value products for all elements seen so far.

repeat produces a repeating stream of a single value. There are more efficient ways to build this in real linked data (e.g., an RDF list with the rdf:rest property pointing to itself).

We can use the make-ones function to produce a repeating stream of ones that gets garbage collected properly. Of course, if we built the repeating list by hand in RDF, this wouldn't be an issue.

The count-and-sum is simply an average in which each item is weighted the same. We use the stream of ones to make this happen by zipping (the Z operator) the ones with the numeric stream. The mean is just a mapping from the stream of duples to a stream of numerics.