O'Reilly logo

Perl Cookbook by Nathan Torkington, Tom Christiansen

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Extracting URLs

Problem

You want to extract all URLs from an HTML file.

Solution

Use the HTML::LinkExtor module from CPAN:

use HTML::LinkExtor;

$parser = HTML::LinkExtor->new(undef, $base_url);
$parser->parse_file($filename);
@links = $parser->links;
foreach $linkarray (@links) {
    my @element = @$linkarray;
    my $elt_type = shift @element;                  # element type

    # possibly test whether this is an element we're interested in
    while (@element) {
        # extract the next attribute and its value
        my ($attr_name, $attr_value) = splice(@element, 0, 2);
        # ... do something with them ...
    }
}

Discussion

You can use HTML::LinkExtor in two different ways: either to call links to get a list of all links in the document once it is completely parsed, or to pass a code reference in the first argument to new. The referenced function will be called on each link as the document is parsed.

The links method clears the link list, so you can call it only once per parsed document. It returns a reference to an array of elements. Each element is itself an array reference with an HTML::Element object at the front followed by a list of attribute name and attribute value pairs. For instance, the HTML:

<A HREF="http://www.perl.com/" >Home page</A>
<IMG SRC="images/big.gif" LOWSRC="images/big-lowres.gif">

would return a data structure like this:

[
  [ a,   href   => "http://www.perl.com/" ],
  [ img, src    =>"images/big.gif",
         lowsrc => "images/big-lowres.gif" ]
]

Here’s an example of how you would use the $elt_type and the $attr_name to print out ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required