## Hacking Information

While eating breakfast this morning, I decided to finish watching “Will We Survive First Contact?,” an episode of *Morgan Freeman’s Through the Wormhole*, a nice series on Science that does a reasonable job of translating science into laymen’s terms without simplifying too much. This episode dealt with how we might know when we encountered alien communication. The topic of aliens was just a vehicle for talking about information theory. Topic modeling made its appearance, though no one called it that.

One of the segments talked about efforts to understand dolphins. The problem with all the languages we already know is that they all come from the human mind. Trying to understand a language developed by a non-human mind helps us know what problems might crop up when trying to understand a language not from Earth.

The first thing we need to do is figure out if we can tell if a signal is transmitting information or not. If it’s not, then it can’t be a language. The show illustrates this using the difference between word frequencies in a random text and word frequencies in a page from a novel. When graphing the results, words are ordered from most frequent to least frequent. Any such graph will either be a flat line (all words equally likely) or sloping from upper left to lower right (by virtue of the ordering). The show claims that if the line slopes, then the signal is carrying information and thus represents a language.

The problem is that I can generate a set of words that has the characteristic they are looking for and yet doesn’t convey any information recognizable as language. Just because the signal has the right shape doesn’t mean it’s the language that we’re looking for. What I can buy for now is that if the signal doesn’t have the right shape, then it can’t be a language we could understand.

The other problem is that I can create a signal that has a flat-line graph, yet conveys language. The way to do this is a bit technical and complicated, but it is doable. It’s the easiest way to hide your traffic from the NSA, and it’s been known for several decades.

If you are concerned about security, then you know that you want to use secure channels when communicating with other computers. This means using HTTPS when on the web, or SSH when using a terminal connection. Both of these hide the content from prying eyes, but they don’t hide that you are communicating. Both methods are prone to attacks if the attacker can accumulate enough traffic. The NSA may already be working on this.

One way is to hide the fact that you are using the channel at all. This doesn’t mean that you don’t communicate. This means that you always communicate. Always transmit packets at regular intervals, all the same size and all with the same statistics. Hide the real information in this stream. Someone watching won’t know when you are sending real information or just gibberish. If you are going through a portal to fetch web pages then you’ll want to introduce random delays along the way. Packets in the stream and resulting actions should not correlate.

Information security is hard. It’s not hard because encryption is hard or setting up SSH streams is hard. Those could be put into the Linux kernel and made universal without too much effort on the part of the user. It’s hard because extracting information from noise is so easy.

Just because we can’t crack what looks like noise doesn’t mean we can’t extract useful information from it. We might not understand dolphin, but we know that some clicks and whistles are more common than others.

One of the common methods for setting up HTTPS is to encrypt the initial exchange using RSA public key (PK) encryption. This involves multiplying and dividing by big numbers. I won’t go into all the math right now, but the security hinges on the difficulty in factoring large numbers. The numbers in this case are the product of two large primes. This product is made public in RSA PK.

A naive approach to cracking RSA would be to go through all the possible factors and see if they multiply out to the expected number. Let’s assume that our big number is 2048 bits long. If a computer can do a billion multiplications per second, it would take more than 10^{607} seconds or about 10^{600} years. Have the computer do a trillion multiplications per second and you decrease the time to 10^{597} years. Have a thousand of these computers working at once and you have 10^{594} years.

A less naive approach is to recognize that if we know the target number and assume one of the primes, we can find the other by dividing and then check by doing another multiplication. This reduces the time needed for a thousand computers each doing a trillion operations per second to about 10^{294} years.

If we look at what a 1024-bit prime looks like, we’ll notice that we already know what two of the bits must be. The prime isn’t the number 2, so it must be odd. The lowest-order bit must be a 1. If the highest-order bit wasn’t a 1, then it wouldn’t be a 1024-bit prime. It might be a 1023-bit prime. So we know that the highest-order bit is a 1. This holds for both primes that go into the public key.

We’ve already taken that into account in the above calculations about how long it would take to crack the public key if we know the high and low order bits of the primes. For each additional bit of information we know about the primes, we cut the time in half.

Let’s assume for a moment that we are multiplying two numbers. Let’s say 11101 and 10011 (29 and 19). What does this look like in long-hand multiplication?

1 | 1 | 1 | 0 | 1 | |||||

1 | 0 | 0 | 1 | 1 | |||||

1 | 1 | 1 | 0 | 1 | |||||

1 | 1 | 1 | 0 | 1 | |||||

0 | 0 | 0 | 0 | 0 | |||||

0 | 0 | 0 | 0 | 0 | |||||

1 | 1 | 1 | 0 | 1 | |||||

1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 1 |

The result, 1000100111, is binary for 551 (29 times 19). Note that there are 10 bits in the result (5+5). The highest order bit is 1, as is the lowest (the product of two odd numbers will be odd). Note that the next to lowest bit is a 1. This is because the next to lowest bits are different (0 and 1). To see this, let’s substitute *x* and *y* for those bits and see how it falls out.

1 | 1 | 1 | x |
1 | |||||

1 | 0 | 0 | y |
1 | |||||

1 | 1 | 1 | x |
1 | |||||

y |
y |
y |
xy |
y |
|||||

0 | 0 | 0 | 0 | 0 | |||||

0 | 0 | 0 | 0 | 0 | |||||

1 | 1 | 1 | x |
1 | |||||

… | … | … | … | … |
… |
… | … | x^y |
1 |

Now we know that if the next to lowest bit in the product is a 1, that the next to lowest bits in the primes are different. If the lowest bit in the product is a 0, then the next to lowest bits in the primes are the same. If we know this bit for one prime, then we know it for the other. We’ve reduced the number of unknown bits by one. Depending on how we attack the problem, we may have cut our time in half to crack the product.

We can continue to decompose the product and generalize it, but the relationships quickly mount and become intractable, especially with traditional computing. We end up with a lot of conditions that the primes must satisfy, but no good way to try them all. This approach is not impossible if we have quantum computers. We just throw all the possibilities at the conditions and see which ones survive. In fact, this kind of entanglement (e.g., the next-to-lowest-order bits being the same or opposite) is what quantum computers should be good at.

Until we have quantum computers, we can turn to other ways of doing massively parallel computers: specialized hardware. Instead of using a general purpose computer, we can use specially designed circuits that solve the problem using basic laws of physics. These computers can only work on one candidate at a time, but they can be made small and mass produced. Even if it takes a million transistors to see if a pair is valid for the key, then we can have a few hundred of these per square centimeter of silicon. Assuming a 1 GHz clock (slow these days), then we’ll get around a trillion calculations per second per square inch of silicon. Depending on power and cooling requirements, we can get a quadrillion calculations per second per cubic foot.

This might not seem like much of a gain, but keep in mind that what we’re solving with this special hardware is really a linear equation. We have 2*k*+5 knowns (the product [2*k* knowns], the fact that the primes are odd [2 knowns], the size of the primes [2 knowns], and the parity of the next-to-lowest bits in the primes [1 known]) and 2*k*-5 unknowns (the two primes [*k* unknowns each] minus the high and low order bits [4 unknowns], and minus the parity of the next-to-lowest bits in the primes [1 known]). There are 10 more knowns than unknowns. The linear system is overdetermined, so we should be able to work our way backwards bit by bit, assuming the right circuits.

Instead of an exponential search (the number of primes to check is O(2* ^{k}*)), we might have a linear search of O(

*k*) since we might be able to determine one bit at a time. If that’s the case, then a trillion operations per second would crack a million keys a second.

This might sound amazing. You might wonder why people aren’t doing this if it could work (and it very well might not). One answer is that the hardware has to be built for the size of the key. If the hardware is built for keys of size 2*k*, then it probably won’t work for keys of size 2*k*+2. Of course, if everyone is using keys of size 2*k* and that’s the stuff you’re interested in, then it might be worth pouring money into developing specialized hardware.

This is the trade-off between specialized solutions and general solutions. We can build brilliant tools that target a specific problem. These tools will break when the problem we tackle is a little different. Or we can build generalized tools that don’t do any one thing as well as the specialized tool, but they do many different things well enough. Thus the wide availability of CPUs in general things like laptops and tablets and the embedded use of FPGAs in specialized devices such as microwaves.

The point of this isn’t that we should abandon specialized tools (though I do think we put too much emphasis on them), but that we shouldn’t discount information that doesn’t quite match what we expect. If we ignored the parity of the next-to-lowest bits because it didn’t give us the real values of those bits, we would not see the larger pattern at play. We’d be stuck with a naive brute-force method of breaking the keys.

It’s easy to say that knowing how clicks and whistles are distributed is useless because it doesn’t tell us what they mean, but knowing that there might be meaning is the first step towards discovering it. Even if a tool such as topic modeling doesn’t give us the types of results we might get from a close reading, we shouldn’t dismiss the information it can give us. Knowing a little more about the thing we are studying can help us decide how to approach it next. It might feel like we’re going down a rabbit hole in the process, but rabbit holes can lead to amazing places.

For further reading, I recommend: *The Codebreakers: The Comprehensive History of Secret Communication from Ancient Times to the Internet* and *Cryptonomicon*.

Tags: Information theory, Math, Public-key cryptography, RSA, Topic modeling