November 2016
Beginner to intermediate
687 pages
15h 31m
English
A word (Token) is the minimal unit that a machine can understand and process. So any text string cannot be further processed without going through tokenization. Tokenization is the process of splitting the raw string into meaningful tokens. The complexity of tokenization varies according to the need of the NLP application, and the complexity of the language itself. For example, in English it can be as simple as choosing only words and numbers through a regular expression. But for Chinese and Japanese, it will be a very complex task.
>>>s = "Hi Everyone ! hola gr8" # simplest tokenizer >>>print s.split() ['Hi', 'Everyone', '!', 'hola', 'gr8'] >>>from nltk.tokenize import word_tokenize >>>word_tokenize(s) ['Hi', 'Everyone', '!', 'hola', ...
Read now
Unlock full access