To obtain better word vectors, compared to the ones in the Training embedding model section, we'll train another word2vec model. However, this time, we will use a larger corpus—the text8 dataset, which consists of the first 100,000,000 bytes of plain text from Wikipedia. The dataset is included in Gensim and it's tokenized as a single long list of words. With that, let's start:
- As usual, the imports are first. We'll also set the logging to INFO for good measure:
import loggingimport pprint # beautify printsimport gensim.downloader as gensim_downloaderimport matplotlib.pyplot as pltimport numpy as npfrom gensim.models.word2vec import Word2Vecfrom sklearn.manifold import TSNElogging.basicConfig(level=logging.INFO) ...