October 2018
Intermediate to advanced
472 pages
10h 57m
English
Tokenization separates a corpus into sentences, words, or tokens. Tokenization is needed to make our texts ready for further processing and is the first step in creating an NLP pipeline. A token can vary according to the task we are performing or the domain in which we are working, so keep an open mind as to what you consider as a token!
Let's try to load a corpus and use NLTK tokenizer to first tokenize the raw corpus into sentences, and then tokenize each sentence further into words:
text = u"""
Read now
Unlock full access