Clustering with k-means and Incanter

Finally, having tokenized, stemmed, and vectorized our input documents—and with a selection of distance measures to choose from—we're in a position to run clustering on our data. The first clustering algorithm we'll look at is called k-means clustering.

k-means is an iterative algorithm that proceeds as follows:

  1. Randomly pick k cluster centroids.
  2. Assign each of the data points to the cluster with the closest centroid.
  3. Adjust each cluster centroid to the mean of its assigned data points.
  4. Repeat until convergence or the maximum number of iterations reached.

The process is visualized in the following diagram for k=3 clusters:

In the preceding figure, we can see that the initial cluster centroids at iteration 1 don't ...

Get Clojure for Data Science now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.