December 2018
Beginner to intermediate
684 pages
21h 9m
English
k-Means is the most well-known clustering algorithm and was first proposed by Stuart Lloyd at Bell Labs in 1957.
The algorithm finds K centroids and assigns each data point to exactly one cluster with the goal of minimizing the within-cluster variance (called inertia). It typically uses Euclidean distance, but other metrics can also be used. k-Means assumes that clusters are spherical and of equal size, and ignores the covariance among features.
The problem is computationally difficult (np-hard) because there are KN ways to partition the N observations into K clusters. The standard iterative algorithm delivers a local optimum for a given K and proceeds as follows: