Chapter 15. Designing a Web Crawler and Search Engine
You have planned a get-together with your loved ones during the holiday season. You love cooking and have decided to cook all the food by yourself, but you don’t have the recipes for the dishes you wish to prepare. What is the best possible resolution here? You could ask your friends if they have the recipes or go looking through cookbooks, but a simple yet effective solution is using Google search. Google looks across the internet and finds the best results for how to prepare a specific dish. How does Google go through such a vast sea of information and find the perfect answer? In this chapter, we’ll try to figure this out by digging into the architecture of such search systems.
At a high level, the entire system consists of two subsystems: a web crawler and a search engine, as shown in Figure 15-1. A web crawler is essentially software responsible for crawling through web content. Content on the internet is growing exponentially, and web crawlers need to regularly crawl the content to maintain the most updated results. The search engine sits on top of content accumulated by web crawlers and stores it in such a way that it can look for user-searched keywords in the content and present the most useful results.
With this basic understanding, let’s start by gathering the functional and nonfunctional requirements of the proposed system.