Similarity metrics

Similarity metrics [1] are a mathematical construct which is particularly useful in natural language processing—especially in information retrieval. Let's first try to understand what a metric is. We can understand a metric as a function that defines a distance between each pair of elements of a set, or vector. It's clear how this would be useful to us - we can compare between how similar two documents would be based on the distance. A low value returned by the distance function would mean that the two documents are similar, and a high value would mean they are quite different.

While we mention documents in the example, we can technically compare any two elements in a set this also means we can compare between two sets ...

Get Natural Language Processing and Computational Linguistics now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.