9.1. Find XML-Style Tags
Problem
You want to match any HTML, XHTML, or XML tags in a string, in order to remove, modify, count, or otherwise deal with them.
Solution
The most appropriate solution depends on several factors, including the level of accuracy, efficiency, and tolerance for erroneous markup that is acceptable to you. Once you’ve determined the approach that works for your needs, there are any number of things you might want to do with the results. But whether you want to remove the tags, search within them, add or remove attributes, or replace them with alternative markup, the first step is to find them.
Be forewarned that this will be a long recipe, fraught with subtleties, exceptions, and variations. If you’re looking for a quick fix and are not willing to put in the effort to determine the best solution for your needs, you might want to jump to the section of this recipe, which offers a decent mix of tolerance versus precaution.
Quick and dirty
This first solution is simple and more commonly used
than you might expect, but it’s included here mostly for comparison
and for an examination of its flaws. It may be good enough when you
know exactly what type of content you’re dealing with and are not
overly concerned about the consequences of incorrect handling. This
regex matches a <
symbol, then simply continues
until the first >
occurs:
<[^>]*>
Regex options: None |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
Allow > in attribute values
This next regex is again ...
Get Regular Expressions Cookbook, 2nd Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.