Chapter 47. In Depth: k-Means Clustering
In the previous chapters we explored unsupervised machine learning models for dimensionality reduction. Now we will move on to another class of unsupervised machine learning models: clustering algorithms. Clustering algorithms seek to learn, from the properties of the data, an optimal division or discrete labeling of groups of points.
Many clustering algorithms are available in Scikit-Learn and elsewhere,
but perhaps the simplest to understand is an algorithm known as k-means
clustering, which is implemented in sklearn.cluster.KMeans.
We begin with the standard imports:
In[1]:%matplotlibinlineimportmatplotlib.pyplotaspltplt.style.use('seaborn-whitegrid')importnumpyasnp
Introducing k-Means
The k-means algorithm searches for a predetermined number of clusters within an unlabeled multidimensional dataset. It accomplishes this using a simple conception of what the optimal clustering looks like:
-
The cluster center is the arithmetic mean of all the points belonging to the cluster.
-
Each point is closer to its own cluster center than to other cluster centers.
Those two assumptions are the basis of the k-means model. We will soon dive into exactly how the algorithm reaches this solution, but for now let’s take a look at a simple dataset and see the k-means result.
First, let’s generate a two-dimensional dataset containing four distinct blobs. To emphasize that this is an unsupervised algorithm, we will leave the labels out of ...