Extracting or Removing HTML Tags
Problem
You want to remove HTML tags from a string, leaving just plain text.
Solution
The following oft-cited solution is simple but wrong on all but the most trivial HTML:
($plain_text = $html_text) =~ s/<[^>]*>//gs; #WRONG
A correct but slower and slightly more complicated way is to use the CPAN modules:
use HTML::Parse; use HTML::FormatText; $plain_text = HTML::FormatText->new->format(parse_html($html_text));
Discussion
As with almost everything else, there is more than one way to do it. Each solution attempts to strike a balance between speed and flexibility. Occasionally you may find HTML that’s simple enough that a trivial command line call will work:
% perl -pe 's/<[^>]*>//g' file
However, this will break on with files whose tags cross line boundaries, like this:
<IMG SRC = "foo.gif"
ALT = "Flurp!">So, you’ll see people doing this instead:
% perl -0777 -pe 's/<[^>]*>//gs' file
or its scripted equivalent:
{
local $/; # temporary whole-file input mode
$html = <FILE>;
$html =~ s/<[^>]*>//gs;
}But even that isn’t good enough except for simplistic HTML without any interesting bits in it. This approach fails for the following examples of valid HTML (among many others):
<IMG SRC = "foo.gif" ALT = "A > B"> <!-- <A comment> --> <script>if (a<b && a>c)</script> <# Just data #> <![INCLUDE CDATA [ >>>>>>>>>>>> ]]>
If HTML comments include other tags, those solutions would also break on text like this:
<!-- This section commented out.
<B>You can't see me!</B>
-->Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access