How it works...
In this recipe, we separated a string with text into sentences using sent_tokenizer from the NLTK library. sent_tokenizer has been pre-trained to recognize capitalization and different types of punctuation that signal the beginning and the end of a sentence.
We first applied sent_tokenizer to a manually created string in order to become familiar with its functionality. The tokenizer divided the text into a list of seven sentences. We combined the tokenizer with the built-in Python method len() to count the number of sentences in the string.
Next, we loaded a dataset with text and, to speed up the computation, we retained only the first 10 rows of the dataframe using pandas' loc[]. Next, we removed the first part of the text ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access