July 2016
Beginner to intermediate
462 pages
9h 14m
English
The corpora that are part of the NLTK distribution are already tokenized, so we can easily get lists of words and sentences. For our own corpora, we should apply tokenization too. This recipe demonstrates how to implement tokenization with NLTK. The text file we will use is in this book's code bundle. This particular text is in English, but NLTK supports other languages too.
Install NLTK, following the instructions in the Introduction section of this chapter.
The program is in the tokenizing.py file in this book's code bundle:
from nltk.tokenize import sent_tokenize from nltk.tokenize import word_tokenize import dautil as dl