9.3. Remove All XML-Style Tags Except <em> and <strong>

Problem

You want to remove all tags in a string except <em> and <strong>.

In a separate case, you not only want to remove all tags other than <em> and <strong>, you also want to remove <em> and <strong> tags that contain attributes.

Solution

This is a perfect setting to put negative lookahead (explained in Recipe 2.16) to use. Applied to this problem, negative lookahead lets you match what looks like a tag, except when certain words come immediately after the opening < or </. If you then replace all matches with an empty string (following the code in Recipe 3.14), only the approved tags are left behind.

Solution 1: Match tags except <em> and <strong>

</?(?!(?:em|strong)\b)[a-z](?:[^>"']|"[^"]*"|'[^']*')*>
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

In free-spacing mode:

< /?                   # Permit closing tags
(?!
    (?: em | strong )  # List of tags to avoid matching
    \b                 # Word boundary avoids partial word matches
)
[a-z]                  # Tag name initial character must be a-z
(?: [^>"']             # Any character except >, ", or '
  | "[^"]*"            # Double-quoted attribute value
  | '[^']*'            # Single-quoted attribute value
)*
>
Regex options: Case insensitive, free-spacing
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby

Solution 2: Match tags except <em> and <strong>, and any tags that contain attributes

With one change (replacing the \b with \s*>), you can make the regex also match any <em> and <strong> tags that contain ...

Get Regular Expressions Cookbook, 2nd Edition now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.