O'Reilly logo

Think Data Structures by Allen B. Downey

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Chapter 15. Crawling Wikipedia

In this chapter, I present a solution to the previous exercise and analyze the performance of web indexing algorithms. Then we build a simple web crawler.

The Redis-Backed Indexer

In my solution, we store two kinds of structures in Redis:

  • For each search term, we have a URLSet, which is a Redis set of URLs that contain the search term.

  • For each URL, we have a TermCounter, which is a Redis hash that maps each search term to the number of times it appears.

We discussed these data types in the previous chapter. You can also read about Redis structures at http://thinkdast.com/redistypes.

In JedisIndex, I provide a method that takes a search term and returns the Redis key of its URLSet:

private String urlSetKey(String term) {
    return "URLSet:" + term;
}

And a method that takes a URL and returns the Redis key of its TermCounter:

private String termCounterKey(String url) {
    return "TermCounter:" + url;
}

Here’s the implementation of indexPage, which takes a URL and a jsoup Elements object that contains the DOM tree of the paragraphs we want to index:

public void indexPage(String url, Elements paragraphs) {
    System.out.println("Indexing " + url);

    // make a TermCounter and count the terms in the paragraphs
    TermCounter tc = new TermCounter(url);
    tc.processElements(paragraphs);

    // push the contents of the TermCounter to Redis
    pushTermCounterToRedis(tc);
}

To index a page, we

  1. Make a Java TermCounter for the contents of the page, using code from a previous exercise.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required