Chapter 19. Web Scraping in Parallel

Web crawling is fast. At least, it’s usually much faster than hiring a dozen interns to copy data from the internet by hand! Of course, the progression of technology and the hedonic treadmill demand that at a certain point even this will not be “fast enough.” That’s the point at which people generally start to look toward distributed computing.

Unlike most other technology fields, web crawling cannot often be improved simply by “throwing more cycles at the problem.” Running one process is fast; running two processes is not necessarily twice as fast. Running three processes might get you banned from the remote server you’re hammering on with all your requests!

However, in some situations parallel web crawling, or running parallel threads or processes, can still be of benefit:

  • Collecting data from multiple sources (multiple remote servers) instead of just a single source

  • Performing long or complex operations on the collected data (such as doing image analysis or OCR) that could be done in parallel with fetching the data

  • Collecting data from a large web service where you are paying for each query, or where creating multiple connections to the service is within the bounds of your usage agreement

Processes Versus Threads

Threads and processes are not a Python-specific concept. While the exact implementation details differ between (and are dependent on) operating systems, the general consensus in computer science is that processes are larger ...

Get Web Scraping with Python, 3rd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.