9.3. Remove All XML-Style Tags Except <em> and <strong>
Problem
You want to remove all tags in a string except <em> and <strong>.
In a separate case, you not only want to remove all tags other
than <em> and <strong>, you also want to remove
<em> and <strong> tags that contain
attributes.
Solution
This is a perfect setting to put negative lookahead (explained in
Recipe 2.16) to use. Applied to this problem,
negative lookahead lets you match what looks like a tag,
except when certain words come immediately after
the opening < or </. If you then replace all matches with an
empty string (following the code in Recipe 3.14), only the approved tags are left
behind.
Solution 1: Match tags except <em> and <strong>
</?(?!(?:em|strong)\b)[a-z](?:[^>"']|"[^"]*"|'[^']*')*>
| Regex options: Case insensitive |
| Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
In free-spacing mode:
< /? # Permit closing tags
(?!
(?: em | strong ) # List of tags to avoid matching
\b # Word boundary avoids partial word matches
)
[a-z] # Tag name initial character must be a-z
(?: [^>"'] # Any character except >, ", or '
| "[^"]*" # Double-quoted attribute value
| '[^']*' # Single-quoted attribute value
)*
>| Regex options: Case insensitive, free-spacing |
| Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby |
Solution 2: Match tags except <em> and <strong>, and any tags that contain attributes
With one change (replacing the ‹\b› with ‹\s*>›), you can make the regex also match any
<em> and <strong> tags that contain ...