8.3. Remove All XML-Style Tags Except and

Problem

You want to remove all tags in a string except  and .

In a separate case, you not only want to remove all tags other than  and , you also want to remove  and  tags that contain attributes.

Solution

This is a perfect setting to put negative lookahead (explained in Recipe 2.16) to use. Applied to this problem, negative lookahead lets you match what looks like a tag, except when certain words come immediately after the opening < or </. If you then replace all matches with an empty string (Recipe 3.14 shows you how), only the approved tags are left behind.

Solution 1: Match tags except and

</?(?!(?:em|strong)\b)[a-z](?:[^>"']|"[^"]*"|'[^']*')*>

Regex options: Case insensitive

Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

In free-spacing mode:

< /?                   # Permit closing tags
(?!                    # Negative lookahead
    (?: em | strong )  #     List of tags to avoid matching
    \b                 #     Word boundary avoids partial word matches
)                      #
[a-z]                  # Tag name initial character must be a-z
(?: [^>"']             #     Any character except >, ", or '
  | "[^"]*"            #     Double-quoted attribute value
  | '[^']*'            #     Single-quoted attribute value
)*                     #
>                      #

Regex options: Case insensitive, free-spacing

Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby

Solution 2: Match tags except and , and any tags that contain attributes

With one change (replacing the ‹\b› with ‹\s*>›), you can make the regex also match any  and  tags that ...

Get Regular Expressions Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Regular Expressions Cookbook by

8.3. Remove All XML-Style Tags Except <em> and <strong>

Problem

Solution

Solution 1: Match tags except <em> and <strong>

Solution 2: Match tags except <em> and <strong>, and any tags that contain attributes

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly