Getting the data

The data we will use for the first part of this chapter is a set of books from Project Gutenberg at www.gutenberg.org, which is a repository of public domain literature works. The books I used for these experiments come from a variety of authors:

  • Booth Tarkington (22 titles)
  • Charles Dickens (44 titles)
  • Edith Nesbit (10 titles)
  • Arthur Conan Doyle (51 titles)
  • Mark Twain (29 titles)
  • Sir Richard Francis Burton (11 titles)
  • Emile Gaboriau (10 titles)

Overall, there are 177 documents from 7 authors, giving a significant amount of text to work with. A full list of the titles, along with download links and a script to automatically fetch them, is given in the code bundle called getdata.py. If running the code results in significantly ...

Get Learning Data Mining with Python - Second Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.