July 2017
Intermediate to advanced
796 pages
18h 55m
English
Tokenization is the process of enchanting important components from raw text, such as words, and sentences, and breaking the raw texts into individual terms (also called words). If you want to have more advanced tokenization on regular expression matching, RegexTokenizer is a good option for doing so. By default, the parameter pattern (regex, default: s+) is used as delimiters to split the input text. Otherwise, you can also set parameter gaps to false, indicating the regex pattern denotes tokens rather than splitting gaps. This way, you can find all matching occurrences as the tokenization result.
Suppose you have the following sentences:
Read now
Unlock full access