May 2017
Beginner to intermediate
220 pages
5h 2m
English
We can now use AlexaCallback with a slightly modified version of the link crawler we developed earlier to download the top 500 Alexa URLs sequentially. To update the link crawler, it will now take either a start URL or a list of start URLs:
# In link_crawler functionif isinstance(start_url, list): crawl_queue = start_urlelse: crawl_queue = [start_url]
We also need to update the way the robots.txt is handled for each site. We use a simple dictionary to store the parsers per domain (see: https://github.com/kjam/wswp/blob/master/code/chp4/advanced_link_crawler.py#L53-L72). We also need to handle the fact that not every URL we encounter will be relative, and some of them aren't even URLs we can visit, such as e-mail addresses ...
Read now
Unlock full access