O'Reilly logo

Clojure for Data Science by Henry Garner

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Clustering with k-means and Incanter

Finally, having tokenized, stemmed, and vectorized our input documents—and with a selection of distance measures to choose from—we're in a position to run clustering on our data. The first clustering algorithm we'll look at is called k-means clustering.

k-means is an iterative algorithm that proceeds as follows:

  1. Randomly pick k cluster centroids.
  2. Assign each of the data points to the cluster with the closest centroid.
  3. Adjust each cluster centroid to the mean of its assigned data points.
  4. Repeat until convergence or the maximum number of iterations reached.

The process is visualized in the following diagram for k=3 clusters:

In the preceding figure, we can see that the initial cluster centroids at iteration 1 don't ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required