4Cluster Analysis Part 1: Using K-Means to Segment Your Customer Base

Cluster analysis is the practice of gathering up a bunch of objects and separating them into groups of similar objects. By exploring these different groups—determining how they're similar and how they're different—you can learn a lot about the previously amorphous pile of data. And that insight can help you make better decisions at a level that's more detailed than before.

In this way, clustering is called exploratory data mining, because these clustering techniques help tease out relationships in large datasets that are too hard to identify with an eyeball. And revealing relationships in your population is useful across industries whether it's for recommending films based on the habits of folks in a taste cluster, identifying crime hot spots within urban areas, or grouping return-related financial investments to ensure a diversified portfolio spans clusters.

One of my favorite uses for clustering is image clustering—lumping together image files that “look the same” to the computer. For example, many smart phones can cluster similar images together thematically and allow users to navigate between these clusters without going through the entire set of images.

This chapter looks at the most common type of clustering, called K-means clustering, which originated in the 1950s and has since become a go-to clustering technique across industries and the government. It's easy to implement and explain. As well, the ...

Get Data Smart, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.