Chapter 14

Clustering model evaluation

14.1 Introduction

The challenge of reliable model evaluation, discussed for classification and regression models in Chapters 7 and 10, respectively, is similarly important for clustering models. Unlike for the former, though, where there are certain natural quality criteria, for the latter it is not so clear how to assess their quality in an objective way. This results in a much greater number of different performance measures being proposed and used on one hand, and in some considerable reserve with which their outcomes tend to be taken on the other hand.

Even if it is not so widely realized as for more common classification and regression model evaluation, when evaluating clustering models one may also be concerned with their generalization properties. For any performance measure its value on a particular dataset (dataset performance) is therefore a possibly imperfect estimator of the corresponding value on the whole domain (true performance).

Clustering quality measures may, but do not have to, explicitly use instance dissimilarity or similarity measures presented in Chapter 11. Those that do are often applied to evaluate models created by dissimilarity-based clustering algorithms and then it usually makes most sense to adopt the same dissimilarity measure for model creation and evaluation.

14.1.1 Dataset performance

The dataset performance of a clustering model is assessed directly by calculating one or more selected performance ...

Get Data Mining Algorithms: Explained Using R now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Data Mining Algorithms: Explained Using R by Pawel Cichosz

Chapter 14

Clustering model evaluation

14.1 Introduction

14.1.1 Dataset performance

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly