Chapter 8. Unsupervised Methods: Topic Modeling and Clustering

When working with a large number of documents, one of the first questions you want to ask without reading all of them is “What are they talking about?” You are interested in the general topics of the documents, i.e., which (ideally semantic) words are often used together.

Topic modeling tries to solve that challenge by using statistical techniques for finding out topics from a corpus of documents. Depending on your vectorization (see Chapter 5), you might find different kinds of topics. Topics consist of a probability distribution of features (words, n-grams, etc.).

Topics normally overlap with each other; they are not clearly separated. The same is true for documents: it is not possible to assign a document uniquely to a single topic; a document always contains a mixture of different topics. The aim of topic modeling is not primarily to assign a topic to an arbitrary document but to find the global structure of the corpus.

Often, a set of documents has an explicit structure that is given by categories, keywords, and so on. If we want to take a look at the organic composition of the corpus, then topic modeling will help a lot to uncover the latent structure.

Topic modeling has been known for a long time and has gained immense popularity during the last 15 years, mainly due to the invention of LDA,1 a stochastic method for discovering topics. LDA is flexible and allows many modifications. However, it is not the ...

Get Blueprints for Text Analytics Using Python now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.