2.6 COUNTING k-GRAMS

The plan is simple – start with a large sample of plaintext and count

  • The number of times, N(i), the 1-gram i occurs in the text, and
  • The number of times, N(i, j), the 2-gram (i, j) occurs in the text,

and use the sample to construct the parameters of a Markov source. This process has been used by several authors.

  • Kullback's early monograph [Kullback, 1938] on statistical methods in cryptanalysis includes tables of k-gram counts derived from government plaintext telegrams.
  • Appendix A in Seberry and Pierprzyck's [1989] book includes frequency tables of 1-gram and 2-grams in several languages.

It is easy to derive Markov source parameters from a text downloaded from The Project Gutenberg Free eBook Library on the Web site www.gutenberg.com. The text of over 16,000 famous books, including William Shakespeare, H. G. Wells, and Jack London is available for downloading. There are two methods to determine frequencies from downloaded texts: Sliding window counts and jumping window counts.

2.6.1 Sliding Window Counts

Initialization: N(i) = N(i, j) = N(i, j, k) = 0 for 0 ≤ i, j, k < m;

for t := 0 to n − 1 do

N(xt) = N(xt) + 1;

for t := 0 to n − 2 do

N(xt, xt+1) = N(xt, Xt+1) + 1;

for t := 0 to n − 3 do

N(xt, xt+1, xt+2) = N(xt, xt+1, xt+2) + 1;

The resulting sliding window counts satisfy

image

2.6.2 Jumping Window Counts

Initialization: N(i) = N(i, j) = 0 for 0 ≤ i

Get Computer Security and Cryptography now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.