How can I tokenize a sentence with Python?
How can I tokenize a sentence with Python? (source: OReilly)

How do you tokenize a sentence?

Tokenization is breaking the sentence into words and punctuation, and it is the first step to processing text. We will do tokenization in both NLTK and spaCy.

First, we will do tokenization in the Natural Language Toolkit (NLTK).

The result of tokenization is a list of tokens.

from nltk.tokenize import word_tokenize
text1 = "It's true that the chicken was the best bamboozler in the known multiverse."
tokens = word_tokenize(text1)
print(tokens)

Next, we will do tokenization in spaCy (spaCy is a newish Python NLP library with great features).

from spacy.en import English
parser = English()
print(parser)

spaCy keeps space tokens so you have to filter them out.

text1 = "I like statements that are both true and absurd."
tokens = parser(text1)
tokens = [token.orth_ for token in tokens if not token.orth_.isspace()]
print(tokens)

Note for Python 2: spaCy requires Unicode

In Python 2 you need to put u in front of strings, like this u"bob".

Or you need to convert my_string like this

my_string_u = my_string.decode('utf-8',errors='ignore')

As an exercise, you can try to generate edge cases like the one below.

textu = "I'm Mr. O'Malley, and I love things, i.e., tacos etc."
tokens = parser(textu)
tokens = [token.orth_ for token in tokens if not token.orth_.isspace()]
print(tokens)
Article image: How can I tokenize a sentence with Python? (source: OReilly).