9.5. Convert Plain Text to HTML by Adding <p> and <br> Tags

Problem

Given a plain text string, such as a multiline value submitted via a form, you want to convert it to an HTML fragment to display within a web page. Paragraphs, separated by two line breaks in a row, should be surrounded with <p></p>. Additional line breaks should be replaced with <br> tags.

Solution

This problem can be solved in four simple steps. In most programming languages, only the middle two steps benefit from regular expressions.

Step 1: Replace HTML special characters with named character references

As we’re converting plain text to HTML, the first step is to convert the three special HTML characters &, <, and > to named character references (see Table 9-3). Otherwise, the resulting markup could lead to unintended results when displayed in a web browser.

Table 9-3. HTML special character substitutions

Search for

Replace with

&

«&amp;»

<

«&lt;»

>

«&gt;»

Ampersands (&) must be replaced first, since you’ll be adding more ampersands to the subject string as part of the named character references.

Step 2: Replace all line breaks with <br>

Search for:

\r\n?|\n
Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby
\R
Regex options: None
Regex flavors: PCRE 7, Perl 5.10

Replace with:

<br>
Replacement text flavors: .NET, Java, JavaScript, Perl, PHP, Python, Ruby

Step 3: Replace double <br> tags with </p><p>

Search for:

<br>\s*<br>
Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, ...

Get Regular Expressions Cookbook, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.