Finally, having tokenized, stemmed, and vectorized our input documents—and with a selection of distance measures to choose from—we're in a position to run clustering on our data. The first clustering algorithm we'll look at is called *k-means clustering*.

*k*-means is an iterative algorithm that proceeds as follows:

- Randomly pick
*k*cluster centroids. - Assign each of the data points to the cluster with the closest centroid.
- Adjust each cluster centroid to the mean of its assigned data points.
- Repeat until convergence or the maximum number of iterations reached.

The process is visualized in the following diagram for *k=3* clusters:

In the preceding figure, we can see that the initial cluster centroids at iteration 1 don't ...

Start Free Trial

No credit card required