9.5. Convert Plain Text to HTML by Adding <p> and <br> Tags
Problem
Given a plain text string, such as a multiline value
submitted via a form, you want to convert it to an HTML fragment to
display within a web page. Paragraphs, separated by two line breaks in a
row, should be surrounded with <p>⋯</p>. Additional
line breaks should be replaced with <br> tags.
Solution
This problem can be solved in four simple steps. In most programming languages, only the middle two steps benefit from regular expressions.
Step 1: Replace HTML special characters with named character references
As we’re converting plain text to HTML, the first step
is to convert the three special HTML characters &, <, and > to named character references (see
Table 9-3).
Otherwise, the resulting markup could lead to unintended results when
displayed in a web browser.
Table 9-3. HTML special character substitutions
Search for | Replace with |
|---|---|
‹ | « |
‹ | « |
‹ | « |
Ampersands (&) must be
replaced first, since you’ll be adding more ampersands to the subject
string as part of the named character references.
Step 2: Replace all line breaks with <br>
Search for:
\r\n?|\n
| Regex options: None |
| Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
\R
| Regex options: None |
| Regex flavors: PCRE 7, Perl 5.10 |
Replace with:
<br>
| Replacement text flavors: .NET, Java, JavaScript, Perl, PHP, Python, Ruby |
Step 3: Replace double <br> tags with </p><p>
Search for:
<br>\s*<br>
| Regex options: None |
| Regex flavors: .NET, Java, JavaScript, PCRE, ... |
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access