8.5. Convert Plain Text to HTML by Adding <p> and <br> Tags
Problem
Given a plain text string, such as a multiline value submitted via a form, you want
to convert it to an HTML fragment to display within a web page.
Paragraphs, separated by two line breaks in a row, should be
surrounded with <p>⋯</p>. Additional
line breaks should be replaced with <br> tags.
Solution
This problem can be solved in four simple steps. In most programming languages, only the middle two steps benefit from regular expressions.
Step 1: Replace HTML special characters with character entity references
As we’re converting plain text to HTML, the first step is to
convert the three special HTML characters &, <, and > to character entity references (see
Table 8-3).
Otherwise, the resulting markup could lead to unintended results
when displayed in a web browser.
Table 8-3. HTML special character substitutions
Search for | Replace with |
|---|---|
‹ | « |
‹ | « |
‹ | « |
Ampersands (&) must be
replaced first, since you’ll be adding additional ampersands to the
subject string as part of the character entity references.
Step 2: Replace all line breaks with <br>
Search for:
\r\n?|\n
| Regex options: None |
| Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
\R
| Regex options: None |
| Regex flavors: PCRE 7, Perl 5.10 |
Replace with:
<br>
| Replacement text flavors: .NET, Java, JavaScript, Perl, PHP, Python, Ruby |
Step 3: Replace double <br> tags with </p><p>
Search for:
<br>\s*<br>
| Regex options: None |
| Regex flavors: .NET, Java, JavaScript, PCRE, Perl, ... |
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access