How do you tokenize a sentence?
First, we will do tokenization in the Natural Language Toolkit (NLTK).
The result of tokenization is a list of tokens.
from nltk.tokenize import word_tokenize text1 = "It's true that the chicken was the best bamboozler in the known multiverse." tokens = word_tokenize(text1) print(tokens)
Next, we will do tokenization in spaCy (spaCy is a newish Python NLP library with great features).
from spacy.en import English parser = English() print(parser)
spaCy keeps space tokens so you have to filter them out.
text1 = "I like statements that are both true and absurd." tokens = parser(text1) tokens = [token.orth_ for token in tokens if not token.orth_.isspace()] print(tokens)
Note for Python 2: spaCy requires Unicode
In Python 2 you need to put
u in front of strings, like this
Or you need to convert
my_string like this
my_string_u = my_string.decode('utf-8',errors='ignore')
As an exercise, you can try to generate edge cases like the one below.
textu = "I'm Mr. O'Malley, and I love things, i.e., tacos etc." tokens = parser(textu) tokens = [token.orth_ for token in tokens if not token.orth_.isspace()] print(tokens)