9.9. Remove XML-Style Comments


You want to remove comments from an (X)HTML or XML document. For example, you want to remove development comments from a web page before it is served to web browsers, or you want to perform subsequent searches without finding any matches within comments.


Finding comments is not a difficult task, thanks to the availability of lazy quantifiers. Here is the regular expression for the job:

Regex options: Dot matches line breaks
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby

That’s pretty straightforward. As usual, though, JavaScript’s lack of a “dot matches line breaks” option (unless you use the XRegExp library) means that you’ll need to replace the dot with an all-inclusive character class in order for the regular expression to match comments that span more than one line. Following is a version that works with standard JavaScript:

Regex options: None
Regex flavor: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

To remove the comments, replace all matches with the empty string (i.e., nothing). Recipe 3.14 lists code to replace all matches of a regex.


How it works

At the beginning and end of this regular expression are the literal character sequences <!-- and -->. Since none of those characters are special in regex syntax (except within character classes, where hyphens create ranges), they don’t need to be escaped. That just leaves the .*? or [\s\S]*? in the middle of the regex to examine ...

