Chapter 19. Clustering
19.0 Introduction
In much of this book we have looked at supervised machine learning—where we have access to both the features and the target. This is, unfortunately, not always the case. Frequently, we run into situations where we only know the features. For example, imagine we have records of sales from a grocery store and we want to break up sales by whether the shopper is a member of a discount club. This would be impossible using supervised learning because we don’t have a target to train and evaluate our models. However, there is another option: unsupervised learning. If the behavior of discount club members and nonmembers in the grocery store is actually disparate, then the average difference in behavior between two members will be smaller than the average difference in behavior between a member and nonmember shopper. Put another way, there will be two clusters of observations.
The goal of clustering algorithms is to identify those latent groupings of observations, which, if done well, allows us to predict the class of observations even without a target vector. There are many clustering algorithms, and they have a wide variety of approaches to identifying the clusters in data. In this chapter, we will cover a selection of clustering algorithms using scikit-learn and how to use them in practice.
19.1 Clustering Using K-Means
Problem
You want to group observations into k groups.
Solution
Use k-means clustering:
# Load libraries
from
sklearn
Get Machine Learning with Python Cookbook, 2nd Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.