8.3. Remove All XML-Style Tags Except <em> and <strong>
Problem
You want to remove all tags in a string except <em> and <strong>.
In a separate case, you not only want to remove all tags other
than <em> and <strong>, you also want to remove
<em> and <strong> tags that contain
attributes.
Solution
This is a perfect setting to put negative lookahead (explained
in Recipe 2.16) to use. Applied to this
problem, negative lookahead lets you match what looks like a tag,
except when certain words come immediately after
the opening < or </. If you then replace all matches with
an empty string (Recipe 3.14 shows you
how), only the approved tags are left behind.
Solution 1: Match tags except <em> and <strong>
</?(?!(?:em|strong)\b)[a-z](?:[^>"']|"[^"]*"|'[^']*')*>
| Regex options: Case insensitive |
| Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
In free-spacing mode:
< /? # Permit closing tags
(?! # Negative lookahead
(?: em | strong ) # List of tags to avoid matching
\b # Word boundary avoids partial word matches
) #
[a-z] # Tag name initial character must be a-z
(?: [^>"'] # Any character except >, ", or '
| "[^"]*" # Double-quoted attribute value
| '[^']*' # Single-quoted attribute value
)* #
> #| Regex options: Case insensitive, free-spacing |
| Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby |
Solution 2: Match tags except <em> and <strong>, and any tags that contain attributes
With one change (replacing the ‹\b› with ‹\s*>›), you can make the regex also match
any <em> and <strong> tags that ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access