Chapter 18. Clustering

Clustering is an unsupervised machine learning technique used to divide a group into cohorts. It is unsupervised because we don’t give the model any labels; it just inspects the features and determines which samples are similar and belong in a cluster. In this chapter, we will look at the K-means and hierarchical clustering methods. We will also explore the Titanic dataset again using various techniques.

K-Means

The K-means algorithm requires the user to pick the number of clusters or “k.” It then randomly chooses k centroids and assigns each sample to a cluster based on a distance metric from the centroid. Following the assignment, it recalculates the centroids based on the center of every sample assigned to a label. It then repeats assigning samples to clusters based on the new centroids. After a few iterations it should converge.

Because clustering uses distance metrics to determine which samples are similar, the behavior may change depending on the scale of the data. You can standardize the data and put all of the features on the same scale. Some have suggested that a SME might advise against standardizing if the scale hints that some features have more importance. We will standardize the data here in this example.

In this example, we will cluster the Titanic passengers. We will start with two clusters to see if the clustering can tease apart survival (we won’t leak the survival data into the clustering and will only use X, not y).

Unsupervised algorithms ...

Get Machine Learning Pocket Reference now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.