Extracting or Removing HTML Tags

Problem

You want to remove HTML tags from a string, leaving just plain text.

Solution

The following oft-cited solution is simple but wrong on all but the most trivial HTML:

($plain_text = $html_text) =~ s/<[^>]*>//gs;     #WRONG

A correct but slower and slightly more complicated way is to use the CPAN modules:

use HTML::Parse;
use HTML::FormatText;
$plain_text = HTML::FormatText->new->format(parse_html($html_text));

Discussion

As with almost everything else, there is more than one way to do it. Each solution attempts to strike a balance between speed and flexibility. Occasionally you may find HTML that’s simple enough that a trivial command line call will work:

% perl -pe 's/<[^>]*>//g' file

However, this will break on with files whose tags cross line boundaries, like this:

<IMG SRC = "foo.gif"
     ALT = "Flurp!">

So, you’ll see people doing this instead:

% perl -0777 -pe 's/<[^>]*>//gs' file

or its scripted equivalent:

{
    local $/;               # temporary whole-file input mode
    $html = <FILE>;
    $html =~ s/<[^>]*>//gs;
}

But even that isn’t good enough except for simplistic HTML without any interesting bits in it. This approach fails for the following examples of valid HTML (among many others):

<IMG SRC = "foo.gif" ALT = "A > B">

<!-- <A comment> -->

<script>if (a<b && a>c)</script>

<# Just data #>

<![INCLUDE CDATA [ >>>>>>>>>>>> ]]>

If HTML comments include other tags, those solutions would also break on text like this:

<!-- This section commented out.
    <B>You can't see me!</B>
-->

Get Perl Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.