Clustering is a form of unsupervised learning where the task of the learning algorithm is to find some structure in the given dataset. In particular, a notion of similarity or distance among different instances of a dataset is used to learn such clusters. Spark provides K-Means, expectation-maximization (EM), power iteration clustering (PIC), Latent Dirichlet Allocation (LDA), and streaming K-Means.


K-Means is one of the most popular clustering algorithms in which we pre-determine the parameter K—the number of clusters. Or in a more formal way we can define K-Means as a prototype-based, partitional clustering technique that attempts to find a user-specified number of clusters (K) which are represented by their centroids. In ...

Get Building a Recommendation Engine with Scala now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.