When comparing pages in a corpus, there are some things one has to consider (which we do in the following sections).
1. What are the features of a web page which are to be compared?
2. Can we use numbers instead of strings to represent features?
3. What comparison or metric should we use to measure proximity of features?
4. Given that corpora typically contain billions of pages and petabytes of content, what should we do to reduce the features of a corpus to a manageable size?
5. Having chosen features, a metric, and a (probably lossy) compression scheme, how do we find most of the pairs of web pages which neighbor one another?
The next few sections provide brief introductions to each ...