“4137X˙CH05˙Akerkar” 2007/9/8 11:01 page 178 #2
178 CHAPTER 5 Clustering
Hierarchical-clustering techniques do a sequence of partitions in which each partition is
nested into the next partition of the sequence. It creates a hierarchy of clusters in ascending or
descending orders. The hierarchical techniques fall into two types: agglomerative and divisive.
The agglomerative technique starts with as many clusters as there are records, with each
cluster having only one record. The pairs of clusters are sequentially merged until the number
of clusters reduces to k. At each stage, the pairs of clusters that are merged are the ones nearest
to each other. If the merging is continued, it terminates in a hierarchy of clusters, which is
built with just a single cluster containing all the records, at the top of the hierarchy. Divisive
clustering takes the opposite approach. It starts with all the records in one cluster and then
tries to split that cluster into small slices.
Clustering can be carried out on both numerical data and categorical data. In numerical
data, the inherent geometric properties can be used to define the distances between the points,
whereas in categorical data, such a criterion does not exist. Many datasets consist of categorical
attributes on which distance functions are not defined.
5.2 Statistical Methods
Thus far we have seen that clustering refers to the task of grouping objects into classes of
similar objects. Clustering analysis has a long history both in statistics and machine learning.
Clusters of related objects form one kind of information, which is revealed in data mining and
can lead to new insights. There are many partitioning methods that construct partitions of
a database of N objects into a set of k clusters. The construction involves determining the
optimal partition with respect to an objective function. There are approximately
k
N
k!
ways of
partitioning a set of N data points into k-subsets. The partition-clustering algorithm normally
adopts the iterative optimization paradigm, which begins with an initial partition and uses
an iterative control strategy. It tries swapping data points to see if such a swapping enriches
the quality of clustering. It locates a local optimal partition whenever swapping does not
generate any progress in clustering. Furthermore, the quality of clustering is very sensitive to
the initially selected partition.
In order to find clusters, the similarity or dissimilarity of objects needs to be measured.
This is done with metrics on feature vectors such as Euclidean (or Minkowski or Manhattan)
distance for real-valued features,
ρ(x, y)=
d
i=1
(x
i
y
q
i
1/q
and simple matching for categorical features,
ρ(x, y)=|{x
i
|x
i
= y
i
}|.
Frequently, the distances are normalized, or weighted differences are used.
In the next subsection, we will study the k-means algorithm, where each cluster is repre-
sented by the cluster’s center of gravity.

Get Building an Intelligent Web: Theory and Practice now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.