Clustering using KMeans algorithm with MLib

In this recipe, we will demonstrate how you can cluster data points without labels using KMeans algorithm with MLib. As discussed in the introduction of this chapter, MLib is the machine learning component of Apache Spark and is a competitive (even better) alternative to Apache Mahout.

Getting ready

  1. You will be using the Maven project you created in the previous recipe (solving simple text mining problems with Apache Spark). If you have not done so yet, follow steps 1-6 in the Getting ready section of that recipe.
  2. Go to https://github.com/apache/spark/blob/master/data/mllib/kmeans_data.txt, and download the data and save as km-data.txt in the data folder of your project that you created by following the ...

Get Java Data Science Cookbook now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.