Chapter 47. In Depth: k-Means Clustering
In the previous chapters we explored unsupervised machine learning models for dimensionality reduction. Now we will move on to another class of unsupervised machine learning models: clustering algorithms. Clustering algorithms seek to learn, from the properties of the data, an optimal division or discrete labeling of groups of points.
Many clustering algorithms are available in Scikit-Learn and elsewhere,
but perhaps the simplest to understand is an algorithm known as k-means
clustering, which is implemented in sklearn.cluster.KMeans
.
We begin with the standard imports:
In
[
1
]:
%
matplotlib
inlineimport
matplotlib.pyplot
as
plt
plt
.
style
.
use
(
'seaborn-whitegrid'
)
import
numpy
as
np
Introducing k-Means
The k-means algorithm searches for a predetermined number of clusters within an unlabeled multidimensional dataset. It accomplishes this using a simple conception of what the optimal clustering looks like:
-
The cluster center is the arithmetic mean of all the points belonging to the cluster.
-
Each point is closer to its own cluster center than to other cluster centers.
Those two assumptions are the basis of the k-means model. We will soon dive into exactly how the algorithm reaches this solution, but for now let’s take a look at a simple dataset and see the k-means result.
First, let’s generate a two-dimensional dataset containing four distinct blobs. To emphasize that this is an unsupervised algorithm, we will leave the labels out of ...
Get Python Data Science Handbook, 2nd Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.