9.5. Convert Plain Text to HTML by Adding <p> and <br> Tags
Problem
Given a plain text string, such as a multiline value
submitted via a form, you want to convert it to an HTML fragment to
display within a web page. Paragraphs, separated by two line breaks in a
row, should be surrounded with <p>⋯</p>
. Additional
line breaks should be replaced with <br>
tags.
Solution
This problem can be solved in four simple steps. In most programming languages, only the middle two steps benefit from regular expressions.
Step 1: Replace HTML special characters with named character references
As we’re converting plain text to HTML, the first step
is to convert the three special HTML characters &
, <
, and >
to named character references (see
Table 9-3).
Otherwise, the resulting markup could lead to unintended results when
displayed in a web browser.
Table 9-3. HTML special character substitutions
Search for | Replace with |
---|---|
‹ | « |
‹ | « |
‹ | « |
Ampersands (&
) must be
replaced first, since you’ll be adding more ampersands to the subject
string as part of the named character references.
Step 2: Replace all line breaks with <br>
Search for:
\r\n?|\n
Regex options: None |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
\R
Regex options: None |
Regex flavors: PCRE 7, Perl 5.10 |
Replace with:
<br>
Replacement text flavors: .NET, Java, JavaScript, Perl, PHP, Python, Ruby |
Step 3: Replace double <br> tags with </p><p>
Search for:
<br>\s*<br>
Regex options: None |
Regex flavors: .NET, Java, JavaScript, PCRE, ... |
Get Regular Expressions Cookbook, 2nd Edition now with O’Reilly online learning.
O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.