✐

✐

“4137X˙CH05˙Akerkar” — 2007/9/8 — 11:01 — page 178 — #2

✐

✐

✐

✐

✐

✐

178 CHAPTER 5 Clustering

Hierarchical-clustering techniques do a sequence of partitions in which each partition is

nested into the next partition of the sequence. It creates a hierarchy of clusters in ascending or

descending orders. The hierarchical techniques fall into two types: agglomerative and divisive.

The agglomerative technique starts with as many clusters as there are records, with each

cluster having only one record. The pairs of clusters are sequentially merged until the number

of clusters reduces to k. At each stage, the pairs of clusters that are merged are the ones nearest

to each other. If the merging is continued, it terminates in a hierarchy of clusters, which is

built with just a single cluster containing all the records, at the top of the hierarchy. Divisive

clustering takes the opposite approach. It starts with all the records in one cluster and then

tries to split that cluster into small slices.

Clustering can be carried out on both numerical data and categorical data. In numerical

data, the inherent geometric properties can be used to deﬁne the distances between the points,

whereas in categorical data, such a criterion does not exist. Many datasets consist of categorical

attributes on which distance functions are not deﬁned.

5.2 Statistical Methods

Thus far we have seen that clustering refers to the task of grouping objects into classes of

similar objects. Clustering analysis has a long history both in statistics and machine learning.

Clusters of related objects form one kind of information, which is revealed in data mining and

can lead to new insights. There are many partitioning methods that construct partitions of

a database of N objects into a set of k clusters. The construction involves determining the

optimal partition with respect to an objective function. There are approximately

k

N

k!

ways of

partitioning a set of N data points into k-subsets. The partition-clustering algorithm normally

adopts the iterative optimization paradigm, which begins with an initial partition and uses

an iterative control strategy. It tries swapping data points to see if such a swapping enriches

the quality of clustering. It locates a local optimal partition whenever swapping does not

generate any progress in clustering. Furthermore, the quality of clustering is very sensitive to

the initially selected partition.

In order to ﬁnd clusters, the similarity or dissimilarity of objects needs to be measured.

This is done with metrics on feature vectors such as Euclidean (or Minkowski or Manhattan)

distance for real-valued features,

ρ(x, y)=

d

i=1

(x

i

− y

q

i

1/q

and simple matching for categorical features,

ρ(x, y)=|{x

i

|x

i

= y

i

}|.

Frequently, the distances are normalized, or weighted diﬀerences are used.

In the next subsection, we will study the k-means algorithm, where each cluster is repre-

sented by the cluster’s center of gravity.

Get *Building an Intelligent Web: Theory and Practice* now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.