Chapter 9. Marking Up a Document with HTML

This chapter will take you step by step through the process of marking up plain-text documents with HTML5 using regular expressions, concluding what we started early in the book.

Now, if it were me, I’d use AsciiDoc to do this work. But for our purposes here, we’ll pretend that there is no such thing as AsciiDoc (what a shame). We’ll plod along using a few tools we have at hand—namely, sed and Perl—and our own ingenuity.

For our text we’ll still use Coleridge’s poem in rime.txt.

Note

The scripts in this chapter work well with rime.txt because you understand the structure of that file. These scripts will give you less predictable results when used on arbitrary text files; however, they give you a starting point for handling text structures in more complex files.

Matching Tags

Before we start adding markup to the poem, let’s talk about how to match either HTML or XML tags. There are a variety of ways to match a tag, either start-tags (e.g., <html>) or end-tags (e.g., </html>), but I have found the one that follows to be reliable. It will match start-tags, with or without attributes:

<[_a-zA-Z][^>]*>

Here is what it does:

  • The first character is a left angle bracket (<).

  • Elements can begin with an underscore character (_) in XML or a letter in the ASCII range, in either upper- or lowercase (see Technical Notes).

  • Following the start character, the name can be followed by zero or more characters, any character other than a right angle bracket (>).

  • The expression ...

Get Introducing Regular Expressions now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.