LIB_simple_spider

Special spider functions are found in the LIB_simple_spider library. This library provides functions that parse links from a web page when given a URL, archive harvested links in an array, identify the root domain for a URL, and identify links that should be excluded from the archive.

This library, as well as the other scripts featured in this chapter, is available for download at this book’s website.

Example 17-3. Running the simple spider from Example 17-1 and Example 17-2

Harvested: http://video.google.com/videoplay?docid=4221457095668033104&hl=en Harvested: http://www.apogeonline.com/libri/88-503-2658-0/scheda Harvested: http://www.schrenk.com/index.php Harvested: http://www.schrenk.com/strategies.php Harvested: http://www.schrenk.com/webbots.php ...

Get Webbots, Spiders, and Screen Scrapers, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Webbots, Spiders, and Screen Scrapers, 2nd Edition by Michael Schrenk

LIB_simple_spider

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly