July 2017
Intermediate to advanced
360 pages
8h 26m
English
The simplest way to tokenize a sentence into words is provided by the class TreebankWordTokenizer, which, however, has some limitations:
from nltk.tokenize import TreebankWordTokenizer>>> simple_text = 'This is a simple text.'>>> tbwt = TreebankWordTokenizer()>>> print(tbwt.tokenize(simple_text))['This', 'is', 'a', 'simple', 'text', '.']>>> complex_text = 'This isn\'t a simple text'>>> print(tbwt.tokenize(complex_text))['This', 'is', "n't", 'a', 'simple', 'text']
As you can see, in the first case the sentence has been correctly split into words, keeping the punctuation separate (this is not a real issue because it can be removed in a second step). However, in the complex example, the contraction isn't has been split into ...
Read now
Unlock full access