May 2017
Beginner to intermediate
220 pages
5h 2m
English
First, we need to interpret robots.txt to avoid downloading blocked URLs. Python urllib comes with the robotparser module, which makes this straightforward, as follows:
>>> from urllib import robotparser>>> rp = robotparser.RobotFileParser()>>> rp.set_url('http://example.webscraping.com/robots.txt')>>> rp.read()>>> url = 'http://example.webscraping.com'>>> user_agent = 'BadCrawler'>>> rp.can_fetch(user_agent, url)False>>> user_agent = 'GoodCrawler'>>> rp.can_fetch(user_agent, url)True
The robotparser module loads a robots.txt file and then provides a can_fetch()function, which tells you whether a particular user agent is allowed to access a web page or not. Here, when the user agent is set to 'BadCrawler', the robotparser ...
Read now
Unlock full access