8.3. Remove All XML-Style Tags Except <em> and <strong>
Problem
You want to remove all tags in a string except <em>
and <strong>
.
In a separate case, you not only want to remove all tags other
than <em>
and <strong>
, you also want to remove
<em>
and <strong>
tags that contain
attributes.
Solution
This is a perfect setting to put negative lookahead (explained
in Recipe 2.16) to use. Applied to this
problem, negative lookahead lets you match what looks like a tag,
except when certain words come immediately after
the opening <
or </
. If you then replace all matches with
an empty string (Recipe 3.14 shows you
how), only the approved tags are left behind.
Solution 1: Match tags except <em> and <strong>
</?(?!(?:em|strong)\b)[a-z](?:[^>"']|"[^"]*"|'[^']*')*>
Regex options: Case insensitive |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
In free-spacing mode:
< /? # Permit closing tags (?! # Negative lookahead (?: em | strong ) # List of tags to avoid matching \b # Word boundary avoids partial word matches ) # [a-z] # Tag name initial character must be a-z (?: [^>"'] # Any character except >, ", or ' | "[^"]*" # Double-quoted attribute value | '[^']*' # Single-quoted attribute value )* # > #
Regex options: Case insensitive, free-spacing |
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby |
Solution 2: Match tags except <em> and <strong>, and any tags that contain attributes
With one change (replacing the ‹\b
› with ‹\s*>
›), you can make the regex also match
any <em>
and <strong>
tags that ...
Get Regular Expressions Cookbook now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.