11.8. Extracting Links from an HTML File
Problem
You need to extract the URLs that are specified inside an HTML document.
Solution
Use the pc_link_extractor( )
function shown in Example 11-2.
Example 11-2. pc_link_extractor( )
function pc_link_extractor($s) {
$a = array();
if (preg_match_all('/<a\s+.*?href=[\"\']?([^\"\' >]*)[\"\']?[^>]*>(.*?)<\/a>/i',
$s,$matches,PREG_SET_ORDER)) {
foreach($matches as $match) {
array_push($a,array($match[1],$match[2]));
}
}
return $a;
}For example:
$links = pc_link_extractor($page);
Discussion
The pc_link_extractor( )
function returns an array. Each element
of that array is itself a two-element array. The first element is the
target of the link, and the second element is the text that is
linked. For example:
$links=<<<END Click <a href="http://www.oreilly.com">here</a> to visit a computer book publisher. Click <a href="http://www.sklar.com">over here</a> to visit a computer book author. END; $a = pc_link_extractor($links); print_r($a); Array ( [0] => Array ( [0] => http://www.oreilly.com [1] => here ) [1] => Array ( [0] => http://www.sklar.com [1] => over here ) )
The regular expression in pc_link_extractor( )
won’t work on all links, such as those that are
constructed with JavaScript or some hexadecimal escapes, but it
should function on the majority of reasonably well-formed HTML.
See Also
Recipe 13.8 for information on capturing text inside HTML tags;
documentation on preg_match_all( ) at
http://www.php.net/preg-match-all.
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access