Extracting URLs
Problem
You want to extract all URLs from an HTML file.
Solution
Use the HTML::LinkExtor module from CPAN:
use HTML::LinkExtor;
$parser = HTML::LinkExtor->new(undef, $base_url);
$parser->parse_file($filename);
@links = $parser->links;
foreach $linkarray (@links) {
my @element = @$linkarray;
my $elt_type = shift @element; # element type
# possibly test whether this is an element we're interested in
while (@element) {
# extract the next attribute and its value
my ($attr_name, $attr_value) = splice(@element, 0, 2);
# ... do something with them ...
}
}Discussion
You can use HTML::LinkExtor in two different ways: either to call
links to get a list of all links in the document
once it is completely parsed, or to pass a code reference in the
first argument to new. The referenced function
will be called on each link as the document is parsed.
The links method clears the link list, so you can
call it only once per parsed document. It returns a reference to an
array of elements. Each element is itself an array reference with an
HTML::Element object at the front followed by a list of attribute
name and attribute value pairs. For instance, the HTML:
<A HREF="http://www.perl.com/" >Home page</A> <IMG SRC="images/big.gif" LOWSRC="images/big-lowres.gif">
would return a data structure like this:
[
[ a, href => "http://www.perl.com/" ],
[ img, src =>"images/big.gif",
lowsrc => "images/big-lowres.gif" ]
]Here’s an example of how you would use the
$elt_type and the $attr_name to print out ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access