July 2018
Beginner to intermediate
406 pages
9h 55m
English
Unfortunately, scikit-learn does not implement latent Dirichlet allocation. Therefore, we are going to use the gensim package from Python. Gensim was developed by Radim Řehůřek who is a machine learning researcher and consultant in the United Kingdom.
As input data, we are going to use a collection of news reports from the Associated Press (AP). This is a standard dataset for text modeling research, which was used in some of the initial works on topic models. After downloading the data, we can load it by running the following code:
import gensim
from gensim import corpora, models
corpus = corpora.BleiCorpus('./data/ap/ap.dat', './data/ap/vocab.txt')
The corpus variable holds all of the text documents and has loaded ...
Read now
Unlock full access