Clustering text
The clustering of text has many applications. It deals with grouping similar documents based on the words present in the text. One of the most common examples would be the clustering of news articles into similar groups. We will discuss how to implement the clustering of text using Mahout.
The dataset
We will be using Reuters
dataset for the clustering example. This dataset has a repository of e-mails. We will download the dataset and then extract it using tar
to the reuters-sgm
folder. Move to the directory data/chapter10
and execute the following commands:
export MAHOUT_LOCAL=TRUE curl http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.tar.gz -o reuters21578.tar.gz mkdir -p reuters-sgm tar xzf reuters21578.tar.gz -C reuters-sgm ...
Get Learning Apache Mahout now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.