Clustering using KMeans algorithm with MLib

In this recipe, we will demonstrate how you can cluster data points without labels using KMeans algorithm with MLib. As discussed in the introduction of this chapter, MLib is the machine learning component of Apache Spark and is a competitive (even better) alternative to Apache Mahout.

Getting ready

  1. You will be using the Maven project you created in the previous recipe (solving simple text mining problems with Apache Spark). If you have not done so yet, follow steps 1-6 in the Getting ready section of that recipe.
  2. Go to https://github.com/apache/spark/blob/master/data/mllib/kmeans_data.txt, and download the data and save as km-data.txt in the data folder of your project that you created by following the ...

Get Java Data Science Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.