Chapter 5. File Formats
Overview
This chapter describes a number of modules that are used to parse different file formats.
Markup Languages
Python comes with extensive support for the Extensible Markup Language (XML) and Hypertext Markup Language (HTML) file formats. Python also provides basic support for Standard Generalized Markup Language (SGML).
All these formats share the same basic structure because both HTML and XML are derived from SGML. Each document contains a mix of start tags, end tags, plain text (also called character data), and entity references, as shown in the following:
<document name="sample.xml">
<header>This is a header</header>
<body>This is the body text. The text can contain
plain text ("character data"), tags, and
entities.
</body>
</document>
In the previous example, <document>,
<header>, and <body>
are start tags. For each start tag, there’s a corresponding end tag
that looks similar, but has a slash before the tag name. The start
tag can also contain one or more attributes, like
the name attribute in this example.
Everything between a start tag and its matching end tag is called an
element. In the previous example, the
document element contains two other elements:
header and body.
Finally, " is a character entity. It is
used to represent reserved characters in the text sections. In this
case, it’s an ampersand (&), which is used to
start the entity itself. Other common entities include
< for “less than”
(<), and
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access