Building Machine Learning Systems with Python - Third Edition
by Luis Pedro Coelho, Willi Richert, Matthieu Brucher
How not to do it
One text similarity measure is the Levenshtein distance, which also goes by the name edit distance. Let's say we have two words, machine and mchiene. The similarity between them can be expressed as the minimum set of edits that are necessary to turn one word into the other. In this case, the edit distance will be two, as we have to add an a after the m and delete the first e. This algorithm is, however, quite costly as it is bound by the length of the first word times the length of the second word.
Looking at our posts, we could cheat by treating whole words as characters and performing the edit distance calculation on the word level. Let's say we have two posts called, how to format my hard disk, and hard disk format problems ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access