Chapter 10Unsupervised Learning: Clustering and Dimensionality Reduction

This chapter is about techniques for studying the latent structure of your data, in situations where we don't know a priori what it should look like. They are often called “unsupervised” learning because, unlike classification and regression, the “right answers” are not known going in. There are two primary ways of studying a dataset's structure: clustering and dimensionality reduction.

Clustering is an attempt to group the data points into distinct “clusters.” Typically, this is done in the hopes that the different clusters correspond to different underlying phenomena. For example, if you plotted people's height on the x-axis and their weight on the y-axis, you would see two more-or-less clear blobs, corresponding to men and women. An alien who knew nothing else about human biology might hypothesize that we come in two distinct types.

In dimensionality reduction, the goal isn't to look for distinct categories in the data. Instead, the idea is that the different fields are largely redundant, and we want to extract the real, underlying variability in the data. The idea is that your data is d-dimensional, but all of the points actually only lie on a k-dimensional subset of the space (with k < d), plus some d-dimensional noise. For example, in 3d data, your points could line mostly just along a single line or perhaps in a curved circle. Real situations of course are usually not so clean cut. It's more useful ...

Get The Data Science Handbook now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.