25 September 2006

Solving simple substitution ciphers

An unintended tongue-twister, for sure!

A substitution cipher is perhaps the simplest form of cryptography: one letter is substituted for another. You may see these in newspapers or quiz books, sometimes called cryptograms.

I encrypted one of my recent posts and for the sake of space will only reprint the first paragraph here:
ZHTT XFSFH IZHTHG LZ FUH XHOITG USKV DWFU S NWA GHVHTXWMH ZKSE, SX GST OITTIC ZWOBHG IVV S FCIE XYWFU ZSXX FUSF DSX FWZZHG NE HG RIUTXIT. UIDHMHC, FUH TWFFSTE KWITX DHCH LTSNKH FI OSZWFSKWPH IT FUH FLCTIMHC STG BHMWT BHKKE YWXXHG S 42 ESCG VWHKG AISK.
(For the sake of this example, we can ignore numbers and punctuation as they are not encrypted.)

One of the strongest attacks against a substitution cipher is known as frequency analysis. This concept relies on the fact that some letters are used more often than others. Our assumptions here are two-fold: first, that the message is in English (otherwise our frequency rules wouldn't make any sense) and second, that we have enough cipher text to apply to the rules (how much is enough is a question for later). Theoretically, the more cipher text we have the closer it should approach the frequency rules. The post that I encrypted has 2,145 characters so let's see how it fares against our frequency analysis attack.

Using Microsoft Word, I used the 'Find and Replace' feature for each letter of the alphabet and replaced it with itself: the result is that Word tells me how many times the letter replaced itself, and thus, how many times it found that letter in the post. Here are the most frequent letters as they appear:

H: 255 times
F: 239 times
S: 181 times
T: 159 times
I: 156 times
...
R: 2 times

Frequency analysis tells us how often each letter of the English language appears:

E: 12.7%
T: 9.1%
A: 8.2%
O: 7.5%
I: 7.0%
...
Z: 0.1%

Let's begin by examining the cipher text and replacing the three most appearing letters with the respective frequency counterparts: H=e, F=t, and S=a (this also is slightly complicated by the fact that we also need to take the still cipher text E, T, and A and replace them with H, F and S to avoid further confusion):
ZeFF Xtate IZeFeG LZ tUe XeOIFG UaKV DWtU a NWS GeVeFXWMe ZKaH, aX GaF OIFFIC ZWOBeG IVV a tCIH XYWtU ZaXX tUat DaX tWZZeG NH eG RIUFXIF. UIDeMeC, tUe FWttaFH KWIFX DeCe LFaNKe tI OaZWtaKWPe IF tUe tLCFIMeC aFG BeMWF BeKKH YWXXeG a 42 HaCG VWeKG SIaK.
As you can see, there are three instances were S replaced with a is a single word: meaning "a" is probably correct. We also see "tUe" several times which is probably "the" meaning our other two corrections are probably also correct. It also gives us a clue that U is "h". What else can we solve? We see "aX" which likely means X=s since we've already used our "t" (so X=s and U=h, which are conveniently the 7th and 8th most frequent letters):
ZeFF state IZeFeG LZ the seOIFG haKV DWth a NWX GeVeFsWMe ZKaU, as GaF OIFFIC ZWOBeG IVV a tCIU sYWth Zass that Das tWZZeG NU eG RIhFsIF. hIDeMeC, the FWttaFU KWIFs DeCe LFaNKe tI OaZWtaKWPe IF the tLCFIMeC aFG BeMWF BeKKU YWsseG a 42 UaCG VWeKG XIaK.
Still a bit murky but we're starting to see some words emerge that make sense: state, the, a, that.

Fast-forward. It turns out that frequency analysis on this text is not perfect, but close: the sequence ETAOINSHRDLC are the twelve most frequent letters (in order) and make up over 80% of all letters. In my post, the most frequent were ETANOISHRLDU. Together you can see how close this becomes:

ETAOINSHRDLC
ETANOISHRLDU

Here is our final solution:

penn state opened up the second half with a big defensive play, as dan connor picked off a troy smith pass that was tipped by ed johnson. however, the nittany lions were unable to capitalize on the turnover and kevin kelly missed a 42 yard field goal.

As you can see, frequency analysis is an important tool in attacking a substitution cipher. We were able to take the cipher text and begin to exploit it.

But why the shortcomings? Perhaps we didn't use enough text. Perhaps my writing style is "different" enough that I use some letters in different frequencies. Also, the content is important: I use the letter "n" more frequently because of the words Penn, Connor, Nittany, Lions, which are used multiple times (in fact, "n" is the only letter in the 12-letter sequence that is off by more than one position). Also, the use of proper names is likely to throw off someone trying to solve a cryptogram. So perhaps a better source text would provide slightly better results.

The learning point of this post is that given such a problem, you have a good place to start. We can begin with frequency analysis and begin to uncover the meaning the cipher text. Further exploitation occurs by applying common sense when appropriate: the most common three letter words are the, and, but, for, and are (two of which I even used in this sentence). The most common two letter words are it, is, of, in, and to.

Have further questions about this post or about cryptography in general? Feel free to post a comment.
Post a Comment