LIB_simple_spider

Special spider functions are found in the LIB_simple_spider library. This library provides functions that parse links from a web page when given a URL, archive harvested links in an array, identify the root domain for a URL, and identify links that should be excluded from the archive.

This library, as well as the other scripts featured in this chapter, is available for download at this book’s website.

Example 17-3. Running the simple spider from Example 17-1 and Example 17-2

Harvested: http://video.google.com/videoplay?docid=4221457095668033104&hl=en Harvested: http://www.apogeonline.com/libri/88-503-2658-0/scheda Harvested: http://www.schrenk.com/index.php Harvested: http://www.schrenk.com/strategies.php Harvested: http://www.schrenk.com/webbots.php ...

Get Webbots, Spiders, and Screen Scrapers, 2nd Edition now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.