Dimensionality reduction
What algorithms such as MinHash and LSH aim to do is reduce the quantity of data that must be stored without compromising on the essence of the original. They're a form of compression and they define helpful representations that preserve our ability to do useful work. In particular, MinHash and LSH are designed to work with data that can be represented as a set.
In fact, there is a whole class of dimensionality-reducing algorithms that will work with data that is not so easily represented as a set. We saw, in the previous chapter with k-means clustering, how certain data could be most usefully represented as a weighted vector. Common approaches to reduce the dimensions of data represented as vectors are principle component ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access