5General Clustering Techniques

5.1 Brief Overview of Clustering

As an exploratory tool, clustering techniques are frequently used as a means of identifying features in a data set, such as sets or subsets of observations with certain common (known or unknown) characteristics. The aim is to group observations into clusters (also called classes, or groups, or categories, etc.) that are internally as homogeneously as possible and as heterogeneous as possible across clusters. In this sense, the methods can be powerful tools in an initial data analysis, as a precursor to more detailed in‐depth analyses.

However, the plethora of available methods can be both a strength and a weakness, since methods can vary in what specific aspects are important to that methodology, and since not all methods give the same answers on the same data set. Even then a given method can produce varying results depending on the underlying metrics used in its applications. For example, if a clustering algorithm is based on distance/dissimilarity measures, we have seen from Chapters 3 and that different measures can have different values for the same data set. While not always so, often the distances used are Euclidean distances (see Eq. (3.1.8)) or other Minkowski distances such as the city block distance. Just as no one distance measure is universally preferred, so it is that no one clustering method is generally preferred. Hence, all give differing results; all have their strengths and weaknesses, depending ...

Get Clustering Methodology for Symbolic Data now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.