9.3. Remove All XML-Style Tags Except <em> and <strong>
Problem
You want to remove all tags in a string except <em> and <strong>.
In a separate case, you not only want to remove all tags other
than <em> and <strong>, you also want to remove
<em> and <strong> tags that contain
attributes.
Solution
This is a perfect setting to put negative lookahead (explained in
Recipe 2.16) to use. Applied to this problem,
negative lookahead lets you match what looks like a tag,
except when certain words come immediately after
the opening < or </. If you then replace all matches with an
empty string (following the code in Recipe 3.14), only the approved tags are left
behind.
Solution 1: Match tags except <em> and <strong>
</?(?!(?:em|strong)\b)[a-z](?:[^>"']|"[^"]*"|'[^']*')*>
| Regex options: Case insensitive |
| Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
In free-spacing mode:
< /? # Permit closing tags
(?!
(?: em | strong ) # List of tags to avoid matching
\b # Word boundary avoids partial word matches
)
[a-z] # Tag name initial character must be a-z
(?: [^>"'] # Any character except >, ", or '
| "[^"]*" # Double-quoted attribute value
| '[^']*' # Single-quoted attribute value
)*
>| Regex options: Case insensitive, free-spacing |
| Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby |
Solution 2: Match tags except <em> and <strong>, and any tags that contain attributes
With one change (replacing the ‹\b› with ‹\s*>›), you can make the regex also match any
<em> and <strong> tags that contain ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access