November 2019
Intermediate to advanced
304 pages
8h 40m
English
In step 1, we used DefaultTokenizerFactory() to create the tokenizer factory to tokenize the words. This is the default tokenizer for Word2Vec and it is based on a string tokenizer, or stream tokenizer. We also used CommonPreprocessor as the token preprocessor. A preprocessor will remove anomalies from the text corpus. The CommonPreprocessor is a token preprocessor implementation that removes punctuation marks and converts the text to lowercase. It uses the toLowerCase(String) method and its behavior depends on the default locale.
Here are the configurations that we made in step 2: