Clustering using KMeans algorithm with MLib
In this recipe, we will demonstrate how you can cluster data points without labels using KMeans algorithm with MLib. As discussed in the introduction of this chapter, MLib is the machine learning component of Apache Spark and is a competitive (even better) alternative to Apache Mahout.
- You will be using the Maven project you created in the previous recipe (solving simple text mining problems with Apache Spark). If you have not done so yet, follow steps 1-6 in the Getting ready section of that recipe.
- Go to https://github.com/apache/spark/blob/master/data/mllib/kmeans_data.txt, and download the data and save as
km-data.txtin the data folder of your project that you created by following the ...