Finally, having tokenized, stemmed, and vectorized our input documents—and with a selection of distance measures to choose from—we're in a position to run clustering on our data. The first clustering algorithm we'll look at is called k-means clustering.
k-means is an iterative algorithm that proceeds as follows:
The process is visualized in the following diagram for k=3 clusters:
In the preceding figure, we can see that the initial cluster centroids at iteration 1 don't ...