Skip to Content
Python in a Nutshell
book

Python in a Nutshell

by Alex Martelli
March 2003
Intermediate to advanced
656 pages
39h 30m
English
O'Reilly Media, Inc.
Content preview from Python in a Nutshell

Chapter 22. Structured Text: HTML

Most documents on the Web use HTML, the HyperText Markup Language. Markup is the insertion of special tokens, known as tags, in a text document to give structure to the text. HTML is an application of the large, general standard known as SGML, the Standard General Markup Language. In practice, many of the Web’s documents use HTML in sloppy or incorrect ways. Browsers have evolved many practical heuristics over the years to try and compensate for this, but even so, it still often happens that a browser displays an incorrect web page in some weird way.

Moreover, HTML was never suitable for much more than presenting documents on a screen. Complete and precise extraction of the information in the document, working backward from the document’s presentation, is often unfeasible. To tighten things up again, HTML has evolved into a more rigorous standard called XHTML. XHTML is very similar to traditional HTML, but it is defined in terms of XML and more precisely than HTML. You can handle XHTML with the tools covered in Chapter 23.

Despite the difficulties, it’s often possible to extract at least some useful information from HTML documents. Python supplies the sgmllib, htmllib, and HTMLParser modules for the task of parsing HTML documents, whether this parsing is for the purpose of presenting the documents, or, more typically, as part of an attempt to extract information from them. Generating HTML and embedding Python in HTML are also frequent tasks. ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

Python in a Nutshell, 3rd Edition

Python in a Nutshell, 3rd Edition

Alex Martelli, Anna Ravenscroft, Steve Holden
Python in a Nutshell, 4th Edition

Python in a Nutshell, 4th Edition

Alex Martelli, Anna Martelli Ravenscroft, Steve Holden, Paul McGuire
Data Wrangling with Python

Data Wrangling with Python

Jacqueline Kazil, Katharine Jarmul

Publisher Resources

ISBN: 0596001886Supplemental ContentCatalog PageErrata