August 2014
Beginner to intermediate
304 pages
7h 10m
English
A corpus is a collection of text documents, and corpora is the plural of corpus. This comes from the Latin word for body; in this case, a body of text. So a custom corpus is really just a bunch of text files in a directory, often alongside many other directories of text files.
You should already have the NLTK data package installed, following the instructions at http://www.nltk.org/data. We'll assume that the data is installed to C:\nltk_data on Windows, and /usr/share/nltk_data on Linux, Unix, and Mac OS X.
NLTK defines a list of data directories, or paths, in nltk.data.path. Our custom corpora must be within one of these paths so it can be found by NLTK. In order to avoid conflict with the ...
Read now
Unlock full access