Tokenizing sentences into words
In this recipe, we'll split a sentence into individual words. The simple task of creating a list of words from a string is an essential part of all text processing.
How to do it...
Basic word tokenization is very simple; use the word_toke
nize()
function:
>>> from nltk.tokenize import word_tokenize >>> word_tokenize('Hello World.') ['Hello', 'World', '.']
How it works...
The word_tokenize()
function is a wrapper function that calls tokenize()
on an instance of the TreebankWordTokenizer
class. It's equivalent to the following code:
>>> from nltk.tokenize import TreebankWordTokenizer >>> tokenizer = TreebankWordTokenizer() >>> tokenizer.tokenize('Hello World.') ['Hello', 'World', '.']
It works by separating words using spaces ...
Get Natural Language Processing: Python and NLTK now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.