LIB_simple_spider

Special spider functions are found in the LIB_simple_spider library. This library provides functions that parse links from a web page when given a URL, archive harvested links in an array, identify the root domain for a URL, and identify links that should be excluded from the archive.

This library, as well as the other scripts featured in this chapter, is available for download at this book's website.

Running the simple spider from Listings 18-1 and 18-2

Figure 18-2. Running the simple spider from Listings 18-1 and 18-2

harvest_links()

The harvest_links() function downloads the specified web page and returns all the links in an array. This function, shown in Listing 18-3, uses the ...

Get Webbots, Spiders, and Screen Scrapers now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.