9.5. Convert Plain Text to HTML by Adding <p> and <br> Tags
Problem
Given a plain text string, such as a multiline value
submitted via a form, you want to convert it to an HTML fragment to
display within a web page. Paragraphs, separated by two line breaks in a
row, should be surrounded with <p>⋯</p>. Additional
line breaks should be replaced with <br> tags.
Solution
This problem can be solved in four simple steps. In most programming languages, only the middle two steps benefit from regular expressions.
Step 1: Replace HTML special characters with named character references
As we’re converting plain text to HTML, the first step
is to convert the three special HTML characters &, <, and > to named character references (see
Table 9-3).
Otherwise, the resulting markup could lead to unintended results when
displayed in a web browser.
Table 9-3. HTML special character substitutions
Search for | Replace with |
|---|---|
‹ | « |
‹ | « |
‹ | « |
Ampersands (&) must be
replaced first, since you’ll be adding more ampersands to the subject
string as part of the named character references.
Step 2: Replace all line breaks with <br>
Search for:
\r\n?|\n
| Regex options: None |
| Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
\R
| Regex options: None |
| Regex flavors: PCRE 7, Perl 5.10 |
Replace with:
<br>
| Replacement text flavors: .NET, Java, JavaScript, Perl, PHP, Python, Ruby |
Step 3: Replace double <br> tags with </p><p>
Search for:
<br>\s*<br>
| Regex options: None |
| Regex flavors: .NET, Java, JavaScript, PCRE, ... |