Last week, we explored the Poisson distribution as a possible distribution of sentence lengths. If you look at the figure for Hunter Crackdown, the Poisson seems reasonable, but it breaks down when looking at other works. In this post, I'd like to go back and try to derive a distribution that has the same qualitative features as the distributions we saw for each of the works. Then, I want to discuss a bit what we might want to do next.
Sentence Lengths for Twenty-One Words and Five Sentences
Suppose we have a number of sentences, labeled by index i, each containing a total of ni words chosen from a dictionary of d words with possible duplication of words. That is, sentence i has ni slots for words and those slots are filled by selecting words from the dictionary without requiring that a word be selected at most once from the dictionary for each sentence.
We can also apply a few more constraints on ni. It must be greater than zero. Sentences must have at least one word. We'll worry later about varying the number of sentences in a work. Additionally, the sum of all of the sentence lengths must equal the number of words in the text.
How many ways can we combine words to form sentences if we have n words in a work and s sentences? The visual we want to work with is easy if we represent words as circles (o) and sentence boundaries as corsses (×):
This says that we have five sentences with lengths 2, 4, 3, 7, and 5. That's 21 circles and 4 crosses. There are one fewer crosses than sentences because we're indicating the boundary between sentences. We could put crosses at the beginning and end of the sequence (resulting in one more cross than sentences), but because they would be invariant, we don't need to consider them.
Now, we have n+s-1 positions into which we place n circles and s-1 crosses. If we place the crosses first, then we can fill in the rest of the positions with circles, so we just need to count how many different ways we can place the crosses.
Note that regardless of how we place the crosses, the average number of words per sentence won't change. We're not changing the number of words or the number of sentences.
What we want to do now is see how likely a particular distribution of sentence lengths might be. This will be a bit tricker. In the following sections, we will assume that the fifth sentence has all of the remaining words. Feel free to skip down to the Discussion section if you don't want to see all of the ways we count words and sentences.
Four Sentences of One Length
Let's consider the case where we have all but one sentence be one word long. The remaining sentence will have all of the other words. In circles and crosses, this might look like the following:
This is the same number of words and sentences as before, but now we have four sentences with one word and one sentence with seventeen words. How many ways can we order these sentences, assuming we only distinguish sentences by their sentence length?
We know we can't split the long sentence, so the only option is to place the long sentence between the one-word sentences. This changes the problem to the following: we have five slots (sentences) filled with four circles (the one word sentences) and one cross (the seventeen word sentence). This is just choosing one out of five, of which there are five different ways. If we're keeping count of how many ways we can get one-word sentences, then we now have twenty (four sentences in five permutations). We also have five ways of having a seventeen word sentence.
The same approach will give us five options of arranging four sentences of equal length with one sentence having the remainder of the words. If we have four sentences of two words each, we have a fifth sentence with thirteen words. Three and nine. Four and five. Five and one.
Three Sentences of One Length Plus One Sentence of Another Length
Now let's consider the case where we have three sentences of one word each, one sentence of two words, and one sentence with sixteen words. Still twenty one words and five sentences. If we place the two-word sentence first, we have five different choices. Once we place it, we have four choices left for the sixteen-word sentence. This gives us twenty different ways of arranging five sentences that consist of one two-word sentence, one sixteen-word sentence, and three one-word sentences.
Again, if we have three sentences with a particular length, another sentence with some other length, and a fifth sentence with the remaining words, we have the following combinations (e.g., 1, 2, and 16 indicate three sentences of length one, one sentence of length two, and one sentence of length sixteen):
Combinations such as (3,3,9) aren't allowed because that's four sentences with length three and one of length nine, but we've already counted that. Likewise, (2,6,9) and (2,9,6) are the same: three sentences of length two, one of length six, and one of length nine. We don't want to double count them. We've already accounted for swapping positions.
We also have three sets that aren't in the above table: three sentences with one length and the remaining words split evenly between the remaining two sentences: (1,9), (3,6), and (5,3)
Now we can add some more counts to our table. For each of the rows in the above table, we have twenty sets of sentences, resulting in sixty one-word sentences, twenty two-word sentences, and twenty sixteen-word sentences for the first triple.
|1||585 = 25 + 560|
|2||480 = 20 + 460|
|3||420 = 20 + 400|
|4||280 = 20 + 260|
|5||205 = 25 + 180|
|9||65 = 5 + 60|
|13||25 = 5 + 20|
If we count up everything, we get 11,025 words in 2,625 sentences, which is 4.2 words per sentence.
Two Sentences of One Length and Two Sentences of Another Length
Now, we consider two sentences of length a and two sentences of length b. The fifth sentence must be of length 21-2a-2b. We also must have a not be the same as b. This gives us five-choose-two choices for the first sentence length and three-choose-one for the fifth sentence for thirty different possibilities. I'll leave the generation of the table of possible sentence lengths as an exercise. The resulting counts of sentence lengths increase to the following:
|1||1095 = 585 + 510|
|2||900 = 480 + 420|
|3||900 = 420 + 480|
|4||580 = 280 + 300|
|5||535 = 205 + 330|
|6||400 = 160 + 240|
|7||350 = 80 + 270|
|8||200 = 80 + 120|
|9||125 = 65 + 60|
|11||120 = 60 + 60|
|13||55 = 25 + 30|
|15||50 = 20 + 30|
Again, we can check our work and find that there are 22,995 words spread across 5,475 sentences for an average sentence length of 4.2.
Two Sentences of One Length, One Sentence Each of Two Other Lengths
Next, we consider two sentences of length a, one sentence of length b, and one sentence of length c. The fifth sentence must be of length 21-2a-b-c. We also must have a, b, and c all different. This gives us five-choose-two choices for the first sentence length and three-choose-one for the third sentence, and two-choose-one for the fourth sentence for sixty different possibilities. I'll leave the generation of the table of possible sentence lengths as another exercise. The resulting counts of sentence lengths increase to the following:
|1||3855 = 1095 + 2760|
|2||3120 = 900 + 2220|
|3||2820 = 900 + 1920|
|4||2020 = 580 + 1440|
|5||1735 = 535 + 1200|
|6||1360 = 400 + 960|
|7||950 = 350 + 600|
|8||740 = 200 + 540|
|9||605 = 125 + 480|
|10||480 = 60 + 420|
|11||360 = 120 + 240|
|12||280 = 40 + 240|
|13||175 = 55 + 120|
|14||100 = 40 + 60|
Now we have 78,435 words in 18,675 sentences for 4.2 words per sentence on average.
Four Sentences All of Different Lengths
Finally, we have the situation in which every single sentence has a different length. Each combination of lengths will have one hundred twenty combinations: the first sentence has five possibilities, the second has four possibilities, etc. The list is short:
The resulting final counts for each sentence length:
|1||4575 = 3855 + 720|
|2||3720 = 3120 + 600|
|3||3660 = 2820 + 840|
|4||2500 = 2020 + 480|
|5||2095 = 1735 + 360|
|6||1600 = 1360 + 240|
|7||1310 = 950 + 360|
|8||980 = 740 + 240|
|9||725 = 605 + 120|
|10||600 = 480 + 120|
|11||480 = 360 + 120|
We can plot the total number of sentences for each length as well as the number of words contained in sentences with each length. The plot of the number of words contained in sentences of particular lengths looks qualitatively like the distributions we saw last week when we counted the sentence lengths in a number of texts. The problem for now is that I'm not able to connect the two: one is counting the number of words assigned to sentences of particular lengths while the other is counting the number of sentences with a particular length.
I'm not too worried about the counts for three, ten, and eleven word sentences. There are several "edge effects" that come about because of the short length of the text. The result of the small length is that we don't have as much freedom for certain sentence lengths. As the length of the text increases, these effects should be less visible.
The important thing in this analysis is that we could have replaced the terms "word" and "sentence" with "sentence" and "paragraph", respectively, and come to the same results (i.e., a text with twenty one sentences divided into five paragraphs). Likewise with "section" and "chapter" (a text with twenty one sections divided into five chapters), or "paragraph" and "chapter" (a text with twenty one paragraphs divided into five chapters).
One way to interpret this is that textual statistics may be invariant with respect to scale. Regardless of the scale at which we examine the text, we expect to see similar statistics: be it the word/sentence, the sentence/paragraph, or the paragraph/chapter.
This is a bit naïve because we may need to account for the information content. Not all words go together in a sentence of a given length. Longer sentences can carry more information, but sentences that are too long may not be readable. This tendency to a middle ground for sentence length combined with the counting we did today may result in a model that gives us something that looks like the results we were getting last week.
I had hoped that we would be a lot further along in this exploration by now, but counting can be difficult to get right. I'm still not confident that I haven't made any mistakes, but everything seems fairly consistent so far.
I am happy though that there is reason to hope that text statistics might be self-similar over a range of scales. This is important for some of the other analysis I want to do eventually.
There are a few questions that come out of the combinatorics that need to be addressed before I'm comfortable.
The number of combinations we ended up counting have 22,875 sentences, but we should see 24,225. Essentially, take the words in the sentence and force the first word to be in the first sentence. Then select four other words from the remaining twenty to be the first words of sentences. The result is twenty choose four, or 4,845 sets of five sentences for 24,225 sentences total. I may need to redo the analysis with this in mind. It may simplify things quite a bit. I tend to take the long route on problems like this anyway.
The other question is how to generalize this to an arbitrary number of sentences for a text with an arbitrary number of words. I need the generalization if I want to test it against any texts.