January 2019
Intermediate to advanced
386 pages
11h 13m
English
To train a good language model, we need a lot of data. The English translation of Leo Tolstoy's "War and peace", which contains more than 500,000 words, makes it a good candidate for our small example. The book is in the public domain and can be downloaded as plain text for free from Project Gutenberg (http://www.gutenberg.org/). As part of preprocessing, we'll remove the Gutenberg license, book information, and table of contents. Next, we will strip out newlines in the middle of sentences and reduce the maximum number of consecutive newlines allowed to two (the code can be found at https://github.com/ivan-vasilev/Python-Deep-Learning-SE/blob/master/ch07/language%20model/data_processing.py).
To feed the data ...