Chapter 19Hierarchical and k-Means Clustering

19.1 The Clustering Task

Clustering refers to the grouping of records, observations, or cases into classes of similar objects. A cluster is a collection of records that are similar to one another and dissimilar to records in other clusters. Clustering differs from classification in that there is no target variable for clustering. The clustering task does not try to classify, estimate, or predict the value of a target variable. Instead, clustering algorithms seek to segment the entire data set into relatively homogeneous subgroups or clusters, where the similarity of the records within the cluster is maximized, and the similarity to records outside this cluster is minimized.

For example, the Nielsen PRIZM segments, developed by Claritas Inc., represent demographic profiles of each geographic area in the United States, in terms of distinct lifestyle types, as defined by zip code. For example, the clusters identified for zip code 90210, Beverly Hills, California, are as follows:

Cluster # 01: Upper Crust Estates
Cluster # 03: Movers and Shakers
Cluster # 04: Young Digerati
Cluster # 07: Money and Brains
Cluster # 16: Bohemian Mix.

The description for Cluster # 01: Upper Crust is “The nation's most exclusive address, Upper Crust is the wealthiest lifestyle in America, a haven for empty-nesting couples between the ages of 45 and 64. No segment has a higher concentration of residents earning over $100,000 a year and possessing a postgraduate ...

Get Data Mining and Predictive Analytics, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Data Mining and Predictive Analytics, 2nd Edition by Chantal D. Larose, Daniel T. Larose

Chapter 19Hierarchical and k-Means Clustering

19.1 The Clustering Task

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly