Chapter 8. Structured Text
In Chapter 6, we took a very brief look at the
csv module that is used to read and write
lines of tab- or comma-separated values, with each line corresponding to one
item in the file. We’ve also looked at a variety of ways to scan files
looking for certain patterns of data, including using str methods and regular expressions. Files that
are in tab- or comma-separated values format, FASTA files, GenBank files,
and many other file formats encountered in bioinformatics work are
called flat files.[35] What is “flat” about them is that they are just text files:
the data has no explicit structure beyond agreed-on conventions regarding special characters,
blank lines, whitespace, etc. They can have introductory material before the data, other
material after the data, several sets of data in one file, and so on.
The opposite of “flat” in this context is structured. A structured text file contains elements, each of which can have attributes and/or “sub” or child elements. There can be different kinds of elements, and in general there are rules specifying what attributes and children each kind of element can have. The linear approaches for processing text files that we’ve seen so far are inadequate for structured files, essentially because the files are two-dimensional. This chapter describes some ways to process structured files.
HTML
An obvious example of a structured file format is basic HTML. (We’ll ignore all the fancy stuff like JavaScript, frames, and so on.) ...