Chapter 22. Structured Text: HTML

Most documents on the web use HTML, the HyperText Markup Language. Markup is the insertion of special tokens, known as tags, in a text document, to structure the text. HTML is, in theory, an application of the large, general standard known as SGML, the Standard Generalized Markup Language. In practice, many documents on the web use HTML in sloppy or incorrect ways.

HTML was designed for presenting documents in a browser. As web content evolved, users realized it lacked the capability for semantic markup, in which the markup indicates the meaning of the delineated text rather than simply its appearance. Complete, precise extraction of the information in an HTML document often turns out to be unfeasible. A more rigorous standard called XHTML attempted to remedy these shortcomings. XHTML is similar to traditional HTML, but it is defined in terms of XML, the eXtensible Markup Language, and more precisely than HTML. You can handle well-formed XHTML with the tools covered in Chapter 23. However, as of this writing, XHTML has not enjoyed overwhelming success, getting scooped instead by the more pragmatic HTML5.

Despite the difficulties, it’s often possible to extract at least some useful information from HTML documents (a task known as web scraping, spidering, or just scraping). Python’s standard library tries to help, supplying the html package for the task of parsing HTML documents, whether for the purpose of presenting the documents or, more typically, ...

Get Python in a Nutshell, 4th Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.