The topic discovery pattern

The topic discovery design pattern explores one way of classifying a corpus of text by the technique called Latent Dirichlet Allocation (LDA) using Pig and Mahout.

Background

The discovery of the hidden topic in a corpus of text is one of the latest developments in the field of natural language processing. The data posted on social media sites generally covers a wide array of subjects. However, in order to extract relevant information from these sites, we have to classify the text corpus based on the relevance of the topics hidden in the text. This will enable automated summarization of a large amount of text and find what it is really about. Prior knowledge of the topics that are thus discovered is used to classify new ...

Get Pig Design Patterns now with the O’Reilly learning platform.

O’Reilly members experience live online training, plus books, videos, and digital content from nearly 200 publishers.