Chapter 14

Alternative Approaches to Cluster Detection

The previous chapter introduces clustering in the business context, using the most common clustering technique. K-means clustering has much to recommend it. It is powerful and quite scalable, so it can run on very large data sets. It is available in most data mining tools. The ambitious can even manage to implement k-means clustering using SQL.

K-means is not, by any means, the only clustering technique. Because the purpose of clustering is to find interesting patterns, having additional techniques is a benefit. Different clustering techniques give more perspectives on the islands of similarity lurking in the data. Sometimes k-means finds good, useful clusters, but not always. This chapter starts with an explanation of the shortcomings of k-means, showing an example where the clusters it identifies are simply not intuitive.

The first alternative method is called Gaussian mixture models (GMM) or sometimes expectation maximization (EM) clustering. This form of clustering is quite similar to k-means. The most obvious difference is that GMM produces soft clusters rather than hard clusters. With soft clusters, a record can be associated with more than one cluster. However, there are other differences as well, and GMM clusters can be more effective than k-means ones.

The next method, divisive clustering, starts with all the data in one big cluster and then looks for ways to split the data, in a process analogous to the creation ...

Get Data Mining Techniques: For Marketing, Sales, and Customer Relationship Management, Third Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.