Zipf’s law describes a relationship between the frequencies and ranks of words in natural languages; see http://en.wikipedia.org/wiki/Zipf%27s_law. The “frequency” of a word is the number of times it appears in a body of work. The “rank” of a word is its position in a list of words sorted by frequency. The most common word has rank 1, the second most common has rank 2, etc.
Specifically, Zipf’s Law predicts that the frequency, f, of the word with rank r is:
where s and c are parameters that depend on the language and the text.
If you take the logarithm of both sides of this equation, you get:
So if you plot versus , you should get a straight line with slope and intercept .
Write a program that reads a text from a file, counts word frequencies, and prints one line for each word in descending order of frequency. You can test it by downloading ...