Case Study: Google NGrams
The Google Books NGrams dataset is a rich trove of data containing the count of every word that occurs in the millions of books scanned by Google Books since 2005. The dataset is publicly available,[17] quite large, and relatively simple to work with. All the files together consist of hundreds of gigabytes of data, and just the counts of individual words take up more than 50GB. Just so we don’t get totally overwhelmed, we’ll cut it down and only take the 2GB file of words that start with the letter A. In the Docker build environment, you can get it by running the /get_ngram_data.sh script, which will download a file called googlebooks-eng-all-1gram-20120701-a.
(If you’re working on your own machine, you can download ...
Get Modern Systems Programming with Scala Native now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.