Unfortunately, scikit-learn does not implement latent Dirichlet allocation. Therefore, we are going to use the gensim package from Python. Gensim was developed by Radim Řehůřek who is a machine learning researcher and consultant in the United Kingdom.
As input data, we are going to use a collection of news reports from the Associated Press (AP). This is a standard dataset for text modeling research, which was used in some of the initial works on topic models. After downloading the data, we can load it by running the following code:
import gensim from gensim import corpora, models corpus = corpora.BleiCorpus('./data/ap/ap.dat', './data/ap/vocab.txt')
The corpus variable holds all of the text documents and has loaded ...