In the Narrative Statistics series of posts, I'm exploring different ways to characterize fiction using statistics. I'm recovering from a flu or cold as well as a nasty cough that followed, so instead of delving into deep math, I want to review what I see as the role of statistics, at least for this series. Many people consider statistics to be magical formulae that give questionable answers. In the humanities, there seems to be a lot of mistrust for statistics because people don't understand them.
I've been in the audience when someone has presented some statistical results and someone else comments that because the outliers obviously don't agree with what they already believe to be true, the outliers must be mistakes and thus the statistical method must be suspect. They then turn around and ask what statistics can provide other than reinforcing what they already know. They first throw out any new information and then ask what new information the methods can provide. The profound lack of logic mystifies me.
Ignoring evidence contrary to what we believe is human nature. Science tries to break us out of the habit. When anyone dismisses science and the scientific method,1 they are opening themselves up to missing truth if it doesn't agree with what they believe to be true. We should always be suspicious of what we believe to be true, especially when we don't have evidence to support what we believe.
At the heart of any statistic is a story. It might not be cause and effect, but it does describe a relationship of some kind. Certain stories go with certain statistical calculations. If we want to know if a story about how something works might be valid, we look at the statistics and see if they match what we see in the real world.
For example, if I think that sentences are of random length with no other considerations, then I expect each length of sentence to be represented equally in a story. There would be as many one word sentences as two word sentences as three word sentences, etc. Look at almost any story and you will see that this is not true. Thus, sentences are not of equally distributed random length. The story "things with random value Y, equally distributed" implies a statistic that has the same probability for any Y, regardless of what that Y might be. If the statistical expectation doesn't match what we measure, then the story can not be in play as an explanation for what we're measuring.
Another story might be that a text has a certain number of words that are randomly divided into sentences. This is boson statistics. I've explored some of this in past posts and so far, the numbers don't seem to match what I see from actual texts. Note that you might already object to this story since few, if any, authors select a number of words, throw in sentence divisions, and then write their text. Remember, it doesn't matter if the story agrees with what we believe to be true. We have to test it. If the tests pass, we have to consider it as a candidate.
In both of these situations, we might calculate the average sentence length and get the same results. It takes other calculations to differentiate between the two. The key is that these other calculations are mathematical descriptions of the stories, not calculations of the results of the stories.
There's another story I want to explore sometime when my mind is a bit more on top of things. It's based on mortality statistics as explained in the blog post "Your body wasn't built to last: a lesson from human mortality rates." This blog post is also a good example of how statistics determine which stories are true. We might want to believe that life expectancy is based on us avoiding random accidents and staying healthy, but ultimately, that can't explain the aggregate statistics we see.
I want to know how long a sentence can expect to live in terms of words. What is the chance that a sentence will live beyond its first word? Second word? There might be occasional strikes of lightening that end a sentence, but most sentences might follow this story. Mortality rates indicate that the probability that we will die in a given year doubles every eight years. The equivalent for sentences would be that for every certain number of words in length, the probability that the sentence will end in the next word doubles. What the initial probability is and how often it doubles is unknown, but can be calculated if the statistics associated with this story hold.
The equation for the probability of a sentence being a length l would be P(l) = exp( -a exp( (l-b)/c ) ). Here, a, b, and c are the variables we use to try and fit the curve to the distribution we find. If we can find a, b, and c such that the curve does match the distribution of sentence "lifespans," then we can consider as a possibility that as a sentence gets longer, authors try harder to end it, with author efforts doubling every certain number of words. If we can't find three numbers to make the curve match, then this can not be an explanation for why sentence lengths are distributed as they are.
I'm throwing different statistics at the problem and seeing which ones stick to the data. Knowing which ones stick helps decide between different stories explaining how narratives are constructed. It doesn't matter if those stories include a human soul, free will, or other supernatural or metaphysical explanation. All that matters is if the story has statistics matching what we see in the data.
Statistics are a tool for weeding through many different competing stories trying to explain something. They don't determine the true explanation by matching the data, but determine what can't be true by not matching.
1I had an English professor once declare that no probe could ever find life on Mars because God hadn't put life on Mars. No evidence was provided for this assertion other than the Bible.