August 2014
Beginner to intermediate
304 pages
7h 10m
English
NLTK's default sentence tokenizer is general purpose, and usually works quite well. But sometimes it is not the best choice for your text. Perhaps your text uses nonstandard punctuation, or is formatted in a unique way. In such cases, training your own sentence tokenizer can result in much more accurate sentence tokenization.
For this example, we'll be using the webtext corpus, specifically the overheard.txt file, so make sure you've downloaded this corpus. The text in this file is formatted as dialog that looks like this:
White guy: So, do you have any plans for this evening? Asian girl: Yeah, being angry! White guy: Oh, that sounds good.
As you can see, this isn't your standard paragraph of sentences formatting, ...
Read now
Unlock full access