9.5. Convert Plain Text to HTML by Adding <p> and <br> Tags

Problem

Given a plain text string, such as a multiline value submitted via a form, you want to convert it to an HTML fragment to display within a web page. Paragraphs, separated by two line breaks in a row, should be surrounded with <p></p>. Additional line breaks should be replaced with <br> tags.

Solution

This problem can be solved in four simple steps. In most programming languages, only the middle two steps benefit from regular expressions.

Step 1: Replace HTML special characters with named character references

As we’re converting plain text to HTML, the first step is to convert the three special HTML characters &, <, and > to named character references (see Table 9-3). Otherwise, the resulting markup could lead to unintended results when displayed in a web browser.

Table 9-3. HTML special character substitutions

Search for

Replace with

&

«&amp;»

<

«&lt;»

>

«&gt;»

Ampersands (&) must be replaced first, since you’ll be adding more ampersands to the subject string as part of the named character references.

Step 2: Replace all line breaks with <br>

Search for:

\r\n?|\n
Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby
\R
Regex options: None
Regex flavors: PCRE 7, Perl 5.10

Replace with:

<br>
Replacement text flavors: .NET, Java, JavaScript, Perl, PHP, Python, Ruby

Step 3: Replace double <br> tags with </p><p>

Search for:

<br>\s*<br>
Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, ...

Get Regular Expressions Cookbook, 2nd Edition now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.