Chapter 6

Detecting the Number of Clusters through Cluster Validation

6.1. Introduction

The general approach to the identification of the number of clusters by means of cluster validation is to evaluate the quality of each k-cluster solution provided by the clustering algorithm and to select the value of k that originates the optimum partition according to the quality criterion [HAL 00]. Over the past decades, many approaches to cluster validation have been proposed in parallel to the advances in clustering techniques. Some of the most popular approaches have been introduced in Chapter 1, namely, the Dunn index [DUN 74, BEL 98, HAV 08], the Krzanowski and Lai test [KRZ 85], the Davies Bouldin score [DAV 79, HAL 02b], the Hubert’s γ [HAL 02a], the silhouette width [ROU 87], or, more recently, the gap statistic [ROB 01] (see chapter 1 for further details). Many of these strategies attempt to minimize/maximize the intra/intercluster dispersion.

Unfortunately, the performance of validation techniques usually depends on the dataset or the cluster algorithm used for partitioning the data. In addition, the distance metrics applied before clustering has proven a relevant factor for the final cluster solution. It may also influence the cluster validity success in determining the optimum number of clusters. In a few cases, prior assumptions about the dataset can be made. This enables the choice of the best fitting clustering technique and distance model. However, unsupervised models are ...

Get Semi-Supervised and Unsupervised Machine Learning: Novel Strategies now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.