CHAPTER 15 Cluster Analysis

This chapter is about the popular unsupervised learning task of clustering, where the goal is to segment the data into a set of homogeneous clusters of records for the purpose of generating insight. Separating a dataset into clusters of homogeneous records is also useful for improving performance of supervised methods, by modeling each cluster separately rather than the entire, heterogeneous dataset. Clustering is used in a vast variety of business applications, from customized marketing to industry analysis. We describe two popular clustering approaches: hierarchical clustering and k-means clustering. In hierarchical clustering, records are sequentially grouped to create clusters, based on distances between records and distances between clusters. We describe how the algorithm works in terms of the clustering process and mention several common distance metrics used. Hierarchical clustering also produces a useful graphical display of the clustering process and results, called a dendrogram. We present dendrograms and illustrate their usefulness. k-means clustering is widely used in large dataset applications. In k-means clustering, records are allocated to one of a prespecified set of clusters, according to their distance from each cluster. We describe the k-means clustering algorithm and its computational advantages. Finally, we present techniques that assist in generating insight from clustering results.

Python

In this chapter, we will use pandas ...

Get Data Mining for Business Analytics now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.