A simple extension of the Word2vec model, applied to the document level, was proposed by Mikilov et al. In this method, in order to obtain document vectors, a unique document ID is appended to the document. It is trained with the words in the document to produce an average (or concatenated) of the word embeddings, in order to produce a document embedding. Hence, in the example that we discussed earlier, the doc2vec model data would look as follows:
- TensorFlow is an open source software library
- Python is an open source interpreted software programming language
Contrary to the earlier approach, the document lists now look as follows:
- [DOC_01, TensorFlow, is, an, open, source, software, library]
- [DOC_02, Python, is, an, open, source, ...