August 2014
Beginner to intermediate
304 pages
7h 10m
English
If you have a large corpus of text, you might want to categorize it into separate sections. This can be helpful for organization, or for text classification, which is covered in Chapter 7, Text Classification. The brown corpus, for example, has a number of different categories, as shown in the following code:
>>> from nltk.corpus import brown >>> brown.categories() ['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction']
In this recipe, we'll learn how to create our own categorized text corpus.
The easiest way to categorize a corpus is to have one file for each category. The following ...