O'Reilly logo

On the Efficient Determination of Most Near Neighbors, 2nd Edition by Mark S. Manasse

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

CHAPTER 2

Comparing Web Pages for Similarity: An Overview

When comparing pages in a corpus, there are some things one has to consider (which we do in the following sections).

1. What are the features of a web page which are to be compared?

2. Can we use numbers instead of strings to represent features?

3. What comparison or metric should we use to measure proximity of features?

4. Given that corpora typically contain billions of pages and petabytes of content, what should we do to reduce the features of a corpus to a manageable size?

5. Having chosen features, a metric, and a (probably lossy) compression scheme, how do we find most of the pairs of web pages which neighbor one another?

The next few sections provide brief introductions to each ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required