CHAPTER 11

The Web and Search

11.1 The World Wide Web

11.2 Python WWW API

11.3 String Pattern Matching

11.4 Case Study: Web Crawler

Chapter Summary

Solutions to Practice Problems

Exercises

Problems

IN THIS CHAPTER, we introduce the World Wide Web (the WWW or simply the web). The web is one of the most important development in computer science. It has become the platform of choice for sharing information and communicating. Consequently the web is a rich source for cutting-edge application development.

We start this chapter by describing the three core WWW technologies: Uniform Resource Locators (URLs), the HyperText Transfer Protocol (HTTP), and the HyperText Markup Language (HTML). We focus especially on HTML, the language of web pages. We then go over the Standard Library modules that enable developers to write programs that access, download, and process documents on the web. We focus, in particular, on mastering tools such as HTML parsers and regular expressions that help us process web pages and analyze the content of text documents.

In this chapter's case study, we develop a web crawler, that is, a program that “crawls through the web.” Our crawler analyzes the content of each visited web page and works by calling itself recursively on every link out of the web page. The crawler is the first step in the development of a search engine, which we do in Chapter 12.

11.1 The World Wide Web

The World Wide Web (WWW or, simply, the web) is a distributed system of documents linked through ...

Get Introduction to Computing Using Python: An Application Development Focus now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.