To run LDA using Mahout, we will use a
20newsgroups dataset. We will convert the corpus into vectors, run LDA on those vectors, and get the resultant topics.
Let's run this example to view how topic modeling works in Mahout.
We will use
20newsgroups dataset for this exercise. Download the
dataset 20news-bydate.tar.gz from http://qwone.com/~jason/20Newsgroups/.
20newsdataand unzip the data here:
mkdir /tmp/20newsdata cdtmp/20newsdata tar-xzvf /tmp/20news-bydate.tar.gz
20news-bydate-train. Now, create another directory
20newsdataalland merge both training and test data of the group.