July 2017
Intermediate to advanced
360 pages
8h 26m
English
In many cases, it's useful to split large text into sentences, which are normally delimited by a full stop or another equivalent mark. As every language has its own orthographic rules, NLTK offers a method called sent_tokenize() that accepts a language (the default is English) and splits the text according to the specific rules. In the following example, we show the usage of this function with different languages:
from nltk.tokenize import sent_tokenize>>> generic_text = 'Lorem ipsum dolor sit amet, amet minim temporibus in sit. Vel ne impedit consequat intellegebat.'>>> print(sent_tokenize(generic_text))['Lorem ipsum dolor sit amet, amet minim temporibus in sit.', 'Vel ne impedit consequat intellegebat.']>>> english_text ...
Read now
Unlock full access