Chapter 16. Boolean Search

In this chapter I present a solution to the previous exercise. Then you will write code to combine multiple search results and sort them by their relevance to the search terms.

Crawler Solution

First, let’s go over our solution to the previous exercise. I provided an outline of WikiCrawler; your job was to fill in crawl. As a reminder, here are the fields in the WikiCrawler class:

public class WikiCrawler {
    // keeps track of where we started
    private final String source;

    // the index where the results go
    private JedisIndex index;

    // queue of URLs to be indexed
    private Queue<String> queue = new LinkedList<String>();

    // fetcher used to get pages from Wikipedia
    final static WikiFetcher wf = new WikiFetcher();
}

When we create a WikiCrawler, we provide source and index. Initially, queue contains only one element, source.

Notice that the implementation of queue is a LinkedList, so we can add elements at the end—and remove them from the beginning—in constant time. By assigning a LinkedList object to a Queue variable, we limit ourselves to using methods in the Queue interface; specifically, we’ll use offer to add elements and poll to remove them.

Here’s my implementation of WikiCrawler.crawl:

 public String crawl(boolean testing) throws IOException { if (queue.isEmpty()) { return null; } String url = queue.poll(); System.out.println("Crawling " + url); if (testing==false && index.isIndexed(url)) { System.out.println("Already indexed."); return null; } Elements paragraphs; ...

Get Think Data Structures now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.