With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

No credit card required

Appendix A. Supplemental Material for Chapter 2

In our discussion of clustering, we primarily used the standard Euclidean distance between vectors in a vector space:

$d\left(x,y\right)=\sqrt{\sum _{i}\left({x}_{i}-{y}_{i}{\right)}^{2}}$

Euclidean distance is also known as the L2 norm. There are several other metrics that are commonly used in applications:

• One variation of Euclidean distance is the L1 norm, also known as Manhattan distance (because it counts the number of “blocks” between two points on a grid):

$d\left(x,y\right)=\sum _{i}|\left({x}_{i}-{y}_{i}\right)|$
• Another is the L norm, defined as the following:

$d\left(x,y\right)=\underset{i}{\mathrm{max}}|\left({x}_{i}-{y}_{i}\right)|$
• For vectors of binary values or bits, you can use Hamming distance, which is the number of bits in common between x and y. This can be computed as:

$d\left(x,y\right)=H\left(¬\left(x\oplus y\right)\right)$

where H(v) is the Hamming weight; that is, the number of “1” bits in v. If the points you compare are of different bit length, the shorter one will need to be prepended with zeros.

• For lists, you can use the Jaccard similarity:

$d\left(x,y\right)=\frac{|x\cap y|}{|x\cup y|}$

The Jaccard similarity computes the number of elements in common between x and y, normalized by the total number of elements in the intersection. One useful property of Jaccard similarity is that you can use it to compare lists of different lengths.

The L1 and L2 metrics in vector spaces suffer from what is known as the “curse of dimensionality.” This phrase refers to the principle that as the number of dimensions increases, all points seem to be roughly equally distant from one another. Thus, ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

No credit card required