December 2019
Intermediate to advanced
468 pages
14h 28m
English
To obtain better word vectors, compared to the ones in the Training embedding model section, we'll train another word2vec model. However, this time, we will use a larger corpus—the text8 dataset, which consists of the first 100,000,000 bytes of plain text from Wikipedia. The dataset is included in Gensim and it's tokenized as a single long list of words. With that, let's start:
import loggingimport pprint # beautify printsimport gensim.downloader as gensim_downloaderimport matplotlib.pyplot as pltimport numpy as npfrom gensim.models.word2vec import Word2Vecfrom sklearn.manifold import TSNElogging.basicConfig(level=logging.INFO) ...
Read now
Unlock full access