Abstract: We explore string comparison, graph theory, and dimensional analysis and their implications in computational textual analysis. In the process, we develop some expectations that can be tested on a large text such as Beowulf, though we only lay out those expectations and do not test them due to the computational requirements for doing so. We draw from Old English vocabulary for our examples.
Mathematics and physics have a long, rich history of using visualization to provide the metaphors that lead to solutions. One area that has been developing quite rapidly in the last few decades is information and chaos theory, two seemingly opposite disciplines that deeply depend on each other. In such classic papers as Shaw's dripping faucet,^{ 1 }we see how information theory leads to a fundamental understanding of the studied system even though we only have access to a single stream of information. In the case of the faucet, Shaw measured the time between drips and was able to reconstruct the entire behavior of the system from those measurements. Using the reconstruction, it is possible to make predictions into the near future, as defined rigorously by the Lyapunov Exponent.^{ 2}
Text is by its nature a linear source of information, just as the measurements Shaw made or the daily price of a stock are a linear source of information. But whereas it is easy for us to characterize as numbers time measurements and prices, text presents some difficulty. We cannot naïvely assign text strings to points on a graph. We must develop some rigorous methods to do so and then see what those methods yield. All of the methods that were used in Shaw's work and in the general corpus of that field are not dependent on a specific origin. There is no center to the universe. In working out the methods that we can use in textual analysis, we will find that we do not need a center either.
Before we can place points on a graph, we must determine what those points are. We naturally think that words should be the points, and in some sense, we would be correct. In our initial development, we will assume that words are sufficient as we usually accept them. In more advanced analysis, we may need to work with symbol strings that have no inherent meaning to us instead of words that we have already associated with meaning. By removing the assumptions that we bring by using preconceived notions of words, we will make the techniques apply more generally and possibly to more languages.
The autocorrelation function is one of the techniques used to determine how to build the graph from a single stream of information. This metafunction shows how a function correlates with itself at arbitrary offsets:
So, for example, the autocorrelation of the sine wave, f(x) = sinx, is the cosine wave, f(x) = cosx (assuming appropriate limits of integration). The time t of the first zero in this function is a good indicator of how long it takes for two points in time to not be correlated such that using that time to construct a graph of <f(x), f(x+t), f(x+2t), ...,f(x+dt)> will provide the best d-dimensional result.
While we can't integrate text as we can mathematical functions, we can do something analogous to the autocorrelation function: find all of the longest common substrings. These are all of the sequences in the text that appear more than once and found by comparing the text with itself. Just as we can draw the autocorrelation of a function by moving a copy of the function over itself and comparing the two, we can find the common substrings in a text by moving one copy of the text as compared to itself and looking at how well they match up. Given an appropriate lower bound on the string size, we can use the resulting strings as our set of words without having to bring into play our knowledge of the language.
Once we know what our points will look like, we must be able to determine how to place those points in a graph. To do this, we must know the distance between them. One measure of distance that we can use is the Levenshtein edit distance: the number of insertions and deletions required to change one text into a second text. For example, to change glades to glæd, we need to replace a with æ (1 deletion and 1 insertion) and insert es on the end for a total of 1 deletion and 3 insertions. This gives us an edit distance of 4. See Table 1 for the complete pair-wise distances for glæd.
glædne | glædre | glædra | glade | gladu | glades | gladum | |
glæd | 2 | 2 | 2 | 3 | 3 | 4 | 4 |
glædne | 0 | 2 | 4 | 3 | 5 | 4 | 6 |
glædre | 0 | 2 | 3 | 5 | 4 | 6 | |
glædra | 0 | 5 | 5 | 6 | 6 | ||
glade | 0 | 2 | 1 | 3 | |||
gladu | 0 | 3 | 1 | ||||
glades | 0 | 4 |
For a distance measure to be proper, it must satisfy certain constraints (let <u, v> be the edits to transform u intov):
|<u, v>| + |<v, w>| ? |<u, w>|
|<u, v>| = |<v, u>|
The first is trivial. If two words differ, then there is at least one edit. Otherwise, the two words are the same and there are no edits. Likewise with the second equation. The number of edits that takes us from the first word to the third word can't be more than the number of edits that would take us from the first word to the second, and then from the second to the third. The third is satisfied by swapping insertions and deletions.
To construct the visualization of the word complexes, we construct a graph with the words at the vertices and the edit distances between two words as the weight of the edge between those words, e.g., Figure 1. We then find the minimal spanning tree which gives us the connections between the words that give us the minimum total edit distances. The resulting graph will have connections between the most similar words. We expect clusters to develop that share the same root, and the roots to be the connection between clusters.
To see this, let us construct the graph for the words associated with se, þes, and he. We should find three primary clusters though they are not well marked in Figure 2: the first is to the left of the line through þissa and þisra, the third is to the right of the line throughþissum and his, while the second is between the first and the third.
Seeing how edit distance operates and can lead to clustering when working with several different word groups simultaneously, we can carry it over to an analysis of Beowulf. We won't actually do the computation because it would take too long for this paper, but we will present the theory. Beowulf is traditionally presented as a collection of lines composed of words. But in the earliest texts, and as we trace text back through time, we see text as a record of what was said rather than an artifact composed of individual, independent units.^{ 3 }Instead of considering Beowulf to be a collection of lines, we can consider it to be one continuous string. Because early English was oral, we can consider it a string without any breaks between words or sentences. The oral retelling of the story would certainly have had pauses and stresses, but these would have been placed by the teller based on their interpretation of the story and might not be as inherent in the text as they are in a modern work.
Looking for common substrings as our way of finding words will not find all of the strings that we normally consider to be words. Combining the notion of edit distance with that of the text of Beowulf being a single string, we can consider sets of strings to be composing a word in its various forms. This might be a simple word such as h...ra(given two strings: h and ra), which would give us hira, hiera, hiora, and heora in Figure 2. By choosing the right minimum size of the strings and the maximum distance between words, we can build a graph of all the words in Beowulf and should be able to show the clusters of words that are similar without requiring prior knowledge of the text. While we could discern some formation of clusters in our simple examples, doing this with Beowulf would provide us with a much larger picture that we could use to find any possible relationships between the resulting distribution and any inherent relationships that we might know about already.
Leaving edit distance behind, let us consider another measure used in information theory and in Shaw's paper on the dripping faucet: the correlation dimension. Physical systems are described by sets of equations that operate simultaneously. The system must satisfy all of its equations at all times. There are a limited number of variables in such a system, the same number as there are equations governing the system. For example, a pendulum has two: position and velocity.^{ 4 }Knowing the value of those two variables tells us everything we need to know about the state of the pendulum. If we drew a graph of the position and velocity of the pendulum, we would find a circle. This circle is known as an attractor because it is the shape that describes an equilibrium motion of the pendulum. Any perturbations of the pendulum may cause it to leave the circle, but it will eventually return. It is attracted to that circle described by plotting its position and velocity.
What Shaw was able to do was take his single source of information, the time between drips, and construct the attractor for the dripping faucet. The attractor is created by plotting the values of all of the independent variables that describe the state of the system. The attractor has as many dimensions as there are independent variables. By calculating the number of dimensions the attractor had, Shaw was able to determine the number of independent variables that described the dripping faucet: three. His computations did not tell him what those three were, but they did place a limit on them.
The correlation dimension is calculated by counting how many points are near each point within successively bigger spheres and higher dimensions. The result shows how well the attractor scales and how well points correlate with each other. If we calculate this dimension for a filled in square, we will find a dimension of 2. If we calculate it for a filled in cube, we will find a dimension of 3. The results fit our expectations. We can apply this idea of correlation dimension to text by seeing how well strings correlate to each other over various distances within the text. Here, we are not using edit distance except in its limited form of how many insertions separate the two strings we are considering, analogous to the distance between points in the correlation dimension calculation, though it would be interesting to actually try and calculate a correlation dimension analogue on the graph we developed previously based on the edit distances: find the dimensionality of the words that make up the language.^{ 5}
The ultimate goal of these techniques is data mining. We are trying to visually tease out some information from the Beowulf text without applying what we might know about the inherent meaning of the text in the hope that we can find some techniques that can be applied to other languages, especially those which are not well understood. The next step will be to develop the complete code required to do the analysis and then try to understand what it produces. We will be looking for correlations and patterns and then trying to match these patterns with known features of the language.
Endnotes
1 Shaw, Robert. Dripping Faucet As a Model Chaotic System. Science Frontier Express. N.p.: Aerial P, 1984.
2 Lyapunov Exponents mark at which point an error introduced at the present will overwhelm the information in the future. E.g., Smith, James. ``Modeling Financial Data as an Ergodic System.'' Presented at the 1998 Spring Meeting of the Texas Section of the American Physical Society. Unpublished. A model is created of Exxon stock and is used to show better than random predictions of the next day's closing stock price.
3 Consider French. While written French does distinguish between words, spoken French works with syllables which can cross word boundaries. The French sentence is the basic unit which contains stress instead of the English word.
4 The equations are p = sint and v = cost. Any two of p, v, and t are necessary and sufficient to find the third. We chose p and v because they are bounded.
5 Which leads to questions of similar dimensions in natural languages versus artificial languages, similar to the fractal nature of `natural' lines that are larger than lines but smaller than planes such as coast lines, rivers, mountains, and leaves versus dams, ditches, roads, and international boundaries.