Chapter 4

Favorite solutions and cluster validation

The criteria and algorithms in the previous chapters depend on co mputational parameters

that were essentially arbitrary but ﬁxed. Fo r any numbe r of clusters, g, and of discards,

n − r, we have obtained a ﬁnite and moderate number of locally optimal (mixture model)

and steady (classiﬁcation model) solutions. Although this number may still be fairly large,

this means a substantial reduction from the initially astronomical number of a ll possible

A B

B

1

B

2

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

C

D

Figure 4.1 The sample data set of four clusters

and six outliers used for illustration.

partitions to a much smaller set. If there is a

true solution, we hope tha t it is still contained

there. The ﬁnal task is to further reduce this

number to a handful of acceptable solutions

or, in the best case, even to a single one.

This is not easy. One re ason is tha t com-

plex data does not know its true structure,

even if there is one. Uniqueness of the solution

belongs to classical inferential thinking. This

is not appropriate here. Even if the number

of components is known, the solution may yet

be ambiguous. In fact, one of the more com-

plex problems of cluster analysis is overcom-

ing the nonuniqueness of the solution given

the number of clusters. For instance, the data

set shown in Fig. 4.1 without the six outliers has been sampled from four normal compo-

nents. However, there are s mall gaps in the lower part of cluster A and a bout the middle of

cluster B. They could give rise to two diﬀerent ﬁve-component solutions. Selecting solutions,

estimating the number of cluster s, as that o f outliers, and cluster validation are therefore

sometimes considered the realm of exploratory and heuristic methods such as visualization.

Nevertheless, there are some statistical tools such as tests that are helpful.

Besides undesire d solutions, there may even be multiple sensible ones. In fact, the

nonuniqueness of the solution in Fig. 4.1 is not by coincidence. A data se t may allow

more than just one interesting solution, a fact that ha s b e e n well known to practitioners.

Gondek [207] writes on p. 245 “Data contains many plausible clusterings” and on p. 249 “It

is often the case that there exist mu ltiple clusterings which are of high quality, i.e., obtain

high values of the objective function. These may consist of minor variations on a single

clustering or may include clusterings which are substantially dissimilar.” Jain et al. [268]

express in the same spirit “Clustering is a subjective process; the same set of data items

often needs to be partitioned diﬀerently for diﬀerent applications.” These quotations support

the fact that, except in simple situations, the clustering problem is ambiguous. It is even-

tually the investigator who must decide among the sensible solutions. Attempting to design

systems that always produce a single solution in one sweep is, therefore, not advisable. Too

often the outputs of computer programs are taken for gra nted.

This chapter deals with estimating the number of groups and outliers and valid solutions.

151

Get *Robust Cluster Analysis and Variable Selection* now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.