December 2018
Beginner to intermediate
684 pages
21h 9m
English
The gensim.models.doc2vec class processes documents in the TaggedDocument format that contains the tokenized documents alongside a unique tag that permits accessing the document vectors after training:
sentences = []for i, (_, text) in enumerate(sample.values): sentences.append(TaggedDocument(words=text.split(), tags=[i]))
The training interface works similar to word2vec with additional parameters to specify the Doc2vec algorithm:
model = Doc2vec(documents=sentences, dm=1, # algorithm: use distributed memory dm_concat=0, # 1: concat, not sum/avg context vectors dbow_words=0, # 1: train word vectors, 0: only doc vectors alpha=0.025, # initial learning rate size=300, window=5, min_count=10, epochs=5, negative=5)model.save('test.model') ...