9.9. Remove XML-Style Comments
Problem
You want to remove comments from an (X)HTML or XML document. For example, you want to remove development comments from a web page before it is served to web browsers, or you want to perform subsequent searches without finding any matches within comments.
Solution
Finding comments is not a difficult task, thanks to the availability of lazy quantifiers. Here is the regular expression for the job:
<!--.*?-->
| Regex options: Dot matches line breaks |
| Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby |
That’s pretty straightforward. As usual, though, JavaScript’s lack of a “dot matches line breaks” option (unless you use the XRegExp library) means that you’ll need to replace the dot with an all-inclusive character class in order for the regular expression to match comments that span more than one line. Following is a version that works with standard JavaScript:
<!--[\s\S]*?-->
| Regex options: None |
| Regex flavor: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
To remove the comments, replace all matches with the empty string (i.e., nothing). Recipe 3.14 lists code to replace all matches of a regex.
Discussion
How it works
At the beginning and end of this regular expression are the
literal character sequences ‹<!--› and ‹-->›. Since none of those characters are
special in regex syntax (except within character classes, where
hyphens create ranges), they don’t need to be escaped. That just
leaves the ‹.*?› or
‹[\s\S]*?› in the middle of the regex to examine ...