IN THIS CHAPTER, we introduce the World Wide Web (the WWW or simply the web). The web is one of the most important development in computer science. It has become the platform of choice for sharing information and communicating. Consequently the web is a rich source for cutting-edge application development.
We start this chapter by describing the three core WWW technologies: Uniform Resource Locators (URLs), the HyperText Transfer Protocol (HTTP), and the HyperText Markup Language (HTML). We focus especially on HTML, the language of web pages. We then go over the Standard Library modules that enable developers to write programs that access, download, and process documents on the web. We focus, in particular, on mastering tools such as HTML parsers and regular expressions that help us process web pages and analyze the content of text documents.
In this chapter's case study, we develop a web crawler, that is, a program that “crawls through the web.” Our crawler analyzes the content of each visited web page and works by calling itself recursively on every link out of the web page. The crawler is the first step in the development of a search engine, which we do in Chapter 12.
The World Wide Web (WWW or, simply, the web) is a distributed system of documents linked through ...