Extracting or Removing HTML Tags
Problem
You want to remove HTML tags from a string, leaving just plain text.
Solution
The following oft-cited solution is simple but wrong on all but the most trivial HTML:
($plain_text = $html_text) =~ s/<[^>]*>//gs; #WRONG
A correct but slower and slightly more complicated way is to use the CPAN modules:
use HTML::Parse; use HTML::FormatText; $plain_text = HTML::FormatText->new->format(parse_html($html_text));
Discussion
As with almost everything else, there is more than one way to do it. Each solution attempts to strike a balance between speed and flexibility. Occasionally you may find HTML that’s simple enough that a trivial command line call will work:
% perl -pe 's/<[^>]*>//g' file
However, this will break on with files whose tags cross line boundaries, like this:
<IMG SRC = "foo.gif"
ALT = "Flurp!">So, you’ll see people doing this instead:
% perl -0777 -pe 's/<[^>]*>//gs' file
or its scripted equivalent:
{
local $/; # temporary whole-file input mode
$html = <FILE>;
$html =~ s/<[^>]*>//gs;
}But even that isn’t good enough except for simplistic HTML without any interesting bits in it. This approach fails for the following examples of valid HTML (among many others):
<IMG SRC = "foo.gif" ALT = "A > B"> <!-- <A comment> --> <script>if (a<b && a>c)</script> <# Just data #> <![INCLUDE CDATA [ >>>>>>>>>>>> ]]>
If HTML comments include other tags, those solutions would also break on text like this:
<!-- This section commented out.
<B>You can't see me!</B>
-->