Chapter 8. Structured Text
In Chapter 6, we took a very brief look at the
csv
module that is used to read and write
lines of tab- or comma-separated values, with each line corresponding to one
item in the file. We’ve also looked at a variety of ways to scan files
looking for certain patterns of data, including using str
methods and regular expressions. Files that
are in tab- or comma-separated values format, FASTA files, GenBank files,
and many other file formats encountered in bioinformatics work are
called flat files.[35] What is “flat” about them is that they are just text files:
the data has no explicit structure beyond agreed-on conventions regarding special characters,
blank lines, whitespace, etc. They can have introductory material before the data, other
material after the data, several sets of data in one file, and so on.
The opposite of “flat” in this context is structured. A structured text file contains elements, each of which can have attributes and/or “sub” or child elements. There can be different kinds of elements, and in general there are rules specifying what attributes and children each kind of element can have. The linear approaches for processing text files that we’ve seen so far are inadequate for structured files, essentially because the files are two-dimensional. This chapter describes some ways to process structured files.
HTML
An obvious example of a structured file format is basic HTML. (We’ll ignore all the fancy stuff like JavaScript, frames, and so on.) ...
Get Bioinformatics Programming Using Python now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.