Chapter 14. Natural Language Corpus Data

Peter Norvig

MOST OF THIS BOOK DEALS WITH DATA THAT IS BEAUTIFUL IN THE SENSE OF BAUDELAIRE: "ALL WHICH IS beautiful and noble is the result of reason and calculation." This chapter's data is beautiful in Thoreau's sense: "All men are really most attracted by the beauty of plain speech." The data we will examine is the plainest of speech: a trillion words of English, taken from publicly available web pages. All the banality of the Web—the spelling and grammatical errors, the LOL cats, the Rickrolling—but also the collected works of Twain, Dickens, Austen, and millions of other authors.

The trillion-word data set was published by Thorsten Brants and Alex Franz of Google in 2006 and is available through the Linguistic Data Consortium (http://tinyurl.com/ngrams). The data set summarizes the original texts by counting the number of appearances of each word, and of each two-, three-, four-, and five-word sequence. For example, "the" appears 23 billion times (2.2% of the trillion words), making it the most common word. The word "rebating" appears 12,750 times (a millionth of a percent), as does "fnuny" (apparently a misspelling of "funny"). In three-word sequences, "Find all posts" appears 13 million times (.001%), about as often as "each of the," but well below the 100 million of "All Rights Reserved" (.01%). Here's an excerpt from the three-word sequences:

outraged many African 63 outraged many Americans 203 outraged many Christians 56 outraged ...

Get Beautiful Data now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.