Building a topic model

Unfortunately, scikit-learn does not implement latent Dirichlet allocation. Therefore, we are going to use the gensim package from Python. Gensim was developed by Radim Řehůřek who is a machine learning researcher and consultant in the United Kingdom.

As input data, we are going to use a collection of news reports from the Associated Press (AP). This is a standard dataset for text modeling research, which was used in some of the initial works on topic models. After downloading the data, we can load it by running the following code:

import gensim 
from gensim import corpora, models 
corpus = corpora.BleiCorpus('./data/ap/ap.dat', './data/ap/vocab.txt') 

The corpus variable holds all of the text documents and has loaded ...

Get Building Machine Learning Systems with Python - Third Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.