Chapter 4
Favorite solutions and cluster validation
The criteria and algorithms in the previous chapters depend on co mputational parameters
that were essentially arbitrary but fixed. Fo r any numbe r of clusters, g, and of discards,
n r, we have obtained a finite and moderate number of locally optimal (mixture model)
and steady (classification model) solutions. Although this number may still be fairly large,
this means a substantial reduction from the initially astronomical number of a ll possible
A B
B
1
B
2
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
C
D
Figure 4.1 The sample data set of four clusters
and six outliers used for illustration.
partitions to a much smaller set. If there is a
true solution, we hope tha t it is still contained
there. The final task is to further reduce this
number to a handful of acceptable solutions
or, in the best case, even to a single one.
This is not easy. One re ason is tha t com-
plex data does not know its true structure,
even if there is one. Uniqueness of the solution
belongs to classical inferential thinking. This
is not appropriate here. Even if the number
of components is known, the solution may yet
be ambiguous. In fact, one of the more com-
plex problems of cluster analysis is overcom-
ing the nonuniqueness of the solution given
the number of clusters. For instance, the data
set shown in Fig. 4.1 without the six outliers has been sampled from four normal compo-
nents. However, there are s mall gaps in the lower part of cluster A and a bout the middle of
cluster B. They could give rise to two different five-component solutions. Selecting solutions,
estimating the number of cluster s, as that o f outliers, and cluster validation are therefore
sometimes considered the realm of exploratory and heuristic methods such as visualization.
Nevertheless, there are some statistical tools such as tests that are helpful.
Besides undesire d solutions, there may even be multiple sensible ones. In fact, the
nonuniqueness of the solution in Fig. 4.1 is not by coincidence. A data se t may allow
more than just one interesting solution, a fact that ha s b e e n well known to practitioners.
Gondek [207] writes on p. 245 Data contains many plausible clusterings and on p. 249 It
is often the case that there exist mu ltiple clusterings which are of high quality, i.e., obtain
high values of the objective function. These may consist of minor variations on a single
clustering or may include clusterings which are substantially dissimilar. Jain et al. [268]
express in the same spirit Clustering is a subjective process; the same set of data items
often needs to be partitioned differently for different applications. These quotations support
the fact that, except in simple situations, the clustering problem is ambiguous. It is even-
tually the investigator who must decide among the sensible solutions. Attempting to design
systems that always produce a single solution in one sweep is, therefore, not advisable. Too
often the outputs of computer programs are taken for gra nted.
This chapter deals with estimating the number of groups and outliers and valid solutions.
151

Get Robust Cluster Analysis and Variable Selection now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.