This chapter will take you step by step through the process of marking up plain-text documents with HTML5 using regular expressions, concluding what we started early in the book.
Now, if it were me, I’d use AsciiDoc to do this work. But for our purposes here, we’ll pretend that there is no such thing as AsciiDoc (what a shame). We’ll plod along using a few tools we have at hand—namely, sed and Perl—and our own ingenuity.
For our text we’ll still use Coleridge’s poem in rime.txt.
The scripts in this chapter work well with rime.txt because you understand the structure of that file. These scripts will give you less predictable results when used on arbitrary text files; however, they give you a starting point for handling text structures in more complex files.
Before we start adding markup to the poem, let’s talk about how to match
either HTML or XML tags. There are a variety of ways to match a
tag, either start-tags (e.g.,
<html>) or end-tags (e.g.,
</html>), but I have found the one that
follows to be reliable. It will match start-tags, with or without
Here is what it does:
The first character is a left angle bracket (<).
Elements can begin with an underscore character (_) in XML or a letter in the ASCII range, in either upper- or lowercase (see Technical Notes).
Following the start character, the name can be followed by zero or more characters, any character other than a right angle bracket (>).
The expression ...