O'Reilly logo

Apache Mahout Clustering Designs by Ashish Gupta

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Running LDA using Mahout

To run LDA using Mahout, we will use a 20newsgroups dataset. We will convert the corpus into vectors, run LDA on those vectors, and get the resultant topics.

Let's run this example to view how topic modeling works in Mahout.

Dataset selection

We will use 20newsgroups dataset for this exercise. Download the dataset 20news-bydate.tar.gz from http://qwone.com/~jason/20Newsgroups/.

Steps to execute CVB (LDA)

  1. Create a directory 20newsdata and unzip the data here:
    mkdir /tmp/20newsdata
    cdtmp/20newsdata
    tar-xzvf /tmp/20news-bydate.tar.gz
    
  2. There are two folders under 20newsdata, 20news-bydate-test, and 20news-bydate-train. Now, create another directory 20newsdataall and merge both training and test data of the group.
  3. Now, move to the ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required