August 2018
Intermediate to advanced
438 pages
12h 3m
English
The gensim framework, created by Radim Rehurek, consists of a robust, efficient, and scalable implementation of the Word2vec model (https://radimrehurek.com/gensim/models/word2vec.html). It allows us to chose either one of the skip-gram or CBOW models. Let's try to learn and visualize the word embedding for the IMDB corpora. As discussed before, this has 50,000 labeled documents and 50,000 unlabeled documents. For learning word representations, we don't need any labels and hence we can use all of the available 100,000 documents.
Let's first load the full corpora. The downloaded documents are divided into train, test, and unsup folders:
def load_imdb_data(directory = 'train', datafile = None): ''' Parse IMDB review ...