In the first example, we'll train a word2vec model on the classic novel War and Peace by Leo Tolstoy. The novel is stored as a regular text file in the code repository. Let's start:
- As the tradition goes, we'll do the imports:
import loggingimport pprint # beautify printsimport gensimimport nltk
- Then, we'll set the logging level to INFO so we can track the training progress:
logging.basicConfig(level=logging.INFO)
- Next, we'll implement the text tokenization pipeline. Tokenization refers to the breaking up of a text sequence into pieces (or tokens) such as words, keywords, phrases, symbols, and other elements. Tokens can be individual words, phrases, or even whole sentences. We'll implement two-level tokenization; ...