May 2017
Beginner to intermediate
220 pages
5h 2m
English
If we crawl a website too quickly, we risk being blocked or overloading the server(s). To minimize these risks, we can throttle our crawl by waiting for a set delay between downloads. Here is a class to implement this:
from urllib.parse import urlparseimport timeclass Throttle: """Add a delay between downloads to the same domain """ def __init__(self, delay): # amount of delay between downloads for each domain self.delay = delay # timestamp of when a domain was last accessed self.domains = {} def wait(self, url): domain = urlparse(url).netloc last_accessed = self.domains.get(domain) if self.delay > 0 and last_accessed is not None: sleep_secs = self.delay - (time.time() - last_accessed) if sleep_secs > 0: ...Read now
Unlock full access