Narrative Statistics: Figuring Out a Distribution of Words in Sentences

Thursdays are my research days. I have a couple things cooking away that I'm not quite ready to write about yet, but I want to take a little time today to explore something that I plan on doing a lot more once my cooking is done.

I'm interested in studying narrative as a dynamic system. That is, there are several variables at play that determine the direction of a narrative. There are plot dynamics, character dynamics, and thematics that an author plays with to construct the story. They all interact in complex ways. A particular plot might require certain type of characters. A particular character might not fit certain types of plots. Some plots and characters don't illustrate well certain themes. The author has to select the right plots, characters, and themes (and write well) for the reader to enjoy the story. 

Usually, when we want to study a dynamic system, we look at the equations that govern its motion, energy, or some other measurable quantity. But most of the interesting things that come out of the humanities don't have such equations. We can't make precise predictions about how a story will be written, or a piece of music will be composed, or what the stock market will do. That doesn't mean that they are random or non-deterministic. We know that there are patterns. When we listen to a piece of music or read a book, we develop expectations about what will happen next. We have innate predictive capabilities that provide our expectations.

When we don't know what the underlying determining system might be, we have a box of tools that we can use to find out about the system. We can develop models that can help us know what might happen next, or know how many independent variables should be in the model, or what kind of distribution of behaviors we might expect.

A classic study in this area is the 1984 paper by Robert Shaw, The Dripping Faucet as a Model Chaotic System, in which the entire dynamic of the dripping faucet is shown to be recoverable by measuring the time between drips.

How does this help us with narrative? If we know what human written narratives look like, then we know what kinds of computational algorithms can produce narratives that look like they are written by people.

With the dripping faucet, for example, we know that there are three, and only three, independent variables that work together to govern the time between drips. So we know that whatever relationships determine the behavior of the drips, they must only use three independent variables. We can rule out any systems of equations that have four independent variables or two independent variables.

With this in mind, I'm embarking on a quest to gather statistics on fiction narrative. What characteristics should computer generated text have if we want to trick people into thinking it was written by a person, even if poorly written or not containing new ideas?

The first, easiest, and probably most obvious thing to do is look at the distribution of sentence lengths. We don't expect sentences to have less than one word in them or more words than there are in the text. We don't expect a uniform distribution of lengths because we tend to use different sentence lengths for different effects. Short, punchy sentences for quick action. Long, languid, meandering sentences for thoughts that need some time to percolate.

An example of a poisson distribution.

This sounds a lot like a poisson distribution. The only parameter needed to draw the distribution is the mean, µ. The equation is:

f(k, µ) = µke-µ / k!

This equation is for a probability distribution, not the actual number of sentences in the text with the given number of words, so we will need to scale it so the maximum matches up with whatever maximum we find for a text.

The next step is to start looking at sentence lengths and see if we get anything that looks like this.

Hacker Crackdown

The first text we can look at is Hacker Crackdown, by Bruce Sterling. The text is available from Project Gutenberg. It has about 101,716 words, 6201 sentences, and averages 16.40316 words per sentence. The maximum sentence length is 213 words. These numbers are approximate because we use Perl to discard the header and license sections from the Project Gutenberg file and then run it through the Lingua::EN::Sentence Perl module to extract the sentences.

Distribution of sentence lengths for Hacker Crackdown

The resulting plot does resemble a poisson distribution, but it's a bit tighter than what we might expect. We obviously need to scale the k parameter (the horizontal axis). The problem we run into is that k is supposed to be an integer. If we scale it, we lose that fundamental requirement for the interpretation of the poisson distribution.

This gives us a distribution with two parameters and a factor to transform the numbers from a probability to the number of sentences with those words:

f(k, µ, A, B) = Bke-µ / (Bk)!

I don't know what A or B should be for the text. I need to spend a bit more time with Mathematica to figure that out. But the plot is qualitatively in the realm of the poisson distribution. I also need to look at other distributions that share some of the same characteristics.

More Texts

Let's look at some more texts just in case Bruce Sterling is a fluke. The following graphs show the distributions of sentence lengths.

Distribution of sentence lengths for Catharine Furze
Distribution of sentence lengths for The Son of Tarzan
Distribution of sentence lengths for A Discourse on Method
Distribution of sentence lengths for Our Friend John Burroughs

That's only four more texts, and the Descarte text looks odd, but this looks like a viable route to go down for a bit. The Descarte text is odd only because the maximum number of sentences with a particular word length is much lower than the other texts. Based on the following table, it seems that the problem with Descarte may be the low number of words relative to the other texts.

Text # Words # Sentences Words / Sentences
Catharine Furze (6023) 67,044 3,672 18.258
Hacker Crackdown (101) 101,716 6,201 16.403
A Discourse on Method (59) 23,050 293 78.669
Our Friend John Burroughs (6561) 65,572 2,739 23.940
The Son of Tarzan (90) 94,531 5,426 17.422

 Next Steps

Besides looking at other distributions and calculating distributions for more texts, we may want to explore some other tools at our disposal, like dimensionality analysis. In a future post, I'll discuss what we might do to figure out how many independent things are going on to construct a text.

What are some of the tools you've used to deconstruct text and see the underlying structures?

Update: Thanks to Travis Brown for pointing out that I could just as well normalize the histogram from the texts. I'll be doing such normalization going forward when we're dealing with anything resembling a probability.

Series Navigation

Narrative Statistics: Revisiting Sentence Length Statistics and What to Do Next >>