This chapter introduces the application we will develop during the rest of the book, a web search engine. I describe the elements of a search engine and introduce the first application, a web crawler that downloads and parses pages from Wikipedia. This chapter also presents a recursive implementation of depth-first search and an iterative implementation that uses a Java
Deque to implement a “last in, first out” stack.
A web search engine, like Google Search or Bing, takes a set of search terms and returns a list of web pages that are relevant to those terms (I’ll discuss what “relevant” means later). You can read more at http://thinkdast.com/searcheng, but I’ll explain what you need as we go along.
The essential components of a search engine are:
We’ll need a program that can download a web page, parse it, and extract the text and any links to other pages.
We’ll need a data structure that makes it possible to look up a search term and find the pages that contain it.
And we’ll need a way to collect results from the index and identify pages that are most relevant to the search terms.
We’ll start with the crawler. The goal of a crawler is to discover and download a set of web pages. For search engines like Google and Bing, the goal is to find all web pages, but often crawlers are limited to a smaller domain. In our case, we will only read pages from Wikipedia.
As a first step, we’ll build a crawler that reads a Wikipedia ...