Now we will extend the sequential crawler to download the web pages in parallel. Note that if misused, a threaded crawler could request content too fast and overload a web server or cause your IP address to be blocked. To avoid this, our crawlers will have a
delay flag to set the minimum number of seconds between requests to the same domain.
The Alexa list example used in this chapter covers 1 million separate domains, so this problem does not apply here. However, a delay of at least one second between downloads should be considered when crawling many web pages from a single domain in future.
How threads and processes work
Here is a diagram of a process containing multiple threads of execution:
When a Python script or other computer ...