Clustering with k-means and Incanter
Finally, having tokenized, stemmed, and vectorized our input documents—and with a selection of distance measures to choose from—we're in a position to run clustering on our data. The first clustering algorithm we'll look at is called k-means clustering.
k-means is an iterative algorithm that proceeds as follows:
- Randomly pick k cluster centroids.
- Assign each of the data points to the cluster with the closest centroid.
- Adjust each cluster centroid to the mean of its assigned data points.
- Repeat until convergence or the maximum number of iterations reached.
The process is visualized in the following diagram for k=3 clusters:
In the preceding figure, we can see that the initial cluster centroids at iteration 1 don't ...