8.3. Remove All XML-Style Tags Except <em> and <strong>

Problem

You want to remove all tags in a string except <em> and <strong>.

In a separate case, you not only want to remove all tags other than <em> and <strong>, you also want to remove <em> and <strong> tags that contain attributes.

Solution

This is a perfect setting to put negative lookahead (explained in Recipe 2.16) to use. Applied to this problem, negative lookahead lets you match what looks like a tag, except when certain words come immediately after the opening < or </. If you then replace all matches with an empty string (Recipe 3.14 shows you how), only the approved tags are left behind.

Solution 1: Match tags except <em> and <strong>

</?(?!(?:em|strong)\b)[a-z](?:[^>"']|"[^"]*"|'[^']*')*>
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

In free-spacing mode:

< /?                   # Permit closing tags
(?!                    # Negative lookahead
    (?: em | strong )  #     List of tags to avoid matching
    \b                 #     Word boundary avoids partial word matches
)                      #
[a-z]                  # Tag name initial character must be a-z
(?: [^>"']             #     Any character except >, ", or '
  | "[^"]*"            #     Double-quoted attribute value
  | '[^']*'            #     Single-quoted attribute value
)*                     #
>                      #
Regex options: Case insensitive, free-spacing
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby

Solution 2: Match tags except <em> and <strong>, and any tags that contain attributes

With one change (replacing the \b with \s*>), you can make the regex also match any <em> and <strong> tags that ...

Get Regular Expressions Cookbook now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.