Extracting or Removing HTML Tags
Problem
You want to remove HTML tags from a string, leaving just plain text.
Solution
The following oft-cited solution is simple but wrong on all but the most trivial HTML:
($plain_text = $html_text) =~ s/<[^>]*>//gs; #WRONG
A correct but slower and slightly more complicated way is to use the CPAN modules:
use HTML::Parse; use HTML::FormatText; $plain_text = HTML::FormatText->new->format(parse_html($html_text));
Discussion
As with almost everything else, there is more than one way to do it. Each solution attempts to strike a balance between speed and flexibility. Occasionally you may find HTML that’s simple enough that a trivial command line call will work:
% perl -pe 's/<[^>]*>//g' file
However, this will break on with files whose tags cross line boundaries, like this:
<IMG SRC = "foo.gif" ALT = "Flurp!">
So, you’ll see people doing this instead:
% perl -0777 -pe 's/<[^>]*>//gs' file
or its scripted equivalent:
{ local $/; # temporary whole-file input mode $html = <FILE>; $html =~ s/<[^>]*>//gs; }
But even that isn’t good enough except for simplistic HTML without any interesting bits in it. This approach fails for the following examples of valid HTML (among many others):
<IMG SRC = "foo.gif" ALT = "A > B"> <!-- <A comment> --> <script>if (a<b && a>c)</script> <# Just data #> <![INCLUDE CDATA [ >>>>>>>>>>>> ]]>
If HTML comments include other tags, those solutions would also break on text like this:
<!-- This section commented out. <B>You can't see me!</B> -->
Get Perl Cookbook now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.