Distance functions

Even if generic definitions of clustering are normally based on the concept of similarity, it's quite easy to employ its inverse, which is represented by distance function (dissimilarity measure). The most common choice is the Euclidean distance, but before choosing it, it's necessary to consider its properties and their behaviors in high-dimensional spaces. Let's start by introducing the Minkowski distance as a generalization of the Euclidean one. If the sample is xi ∈ ℜN, it is defined as:

For p=1, we obtain the Manhattan (or city block) distance, while p=2 corresponds to the standard Euclidean distance. We want to understand ...

Get Hands-On Unsupervised Learning with Python now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.