Get full access to Python Web Scraping - Second Edition and 60K+ other titles, with a free 10-day trial of O'Reilly.

There are also live events, courses curated by job role, and more.

Multiprocessing crawler

To improve the performance further, the threaded example can be extended to support multiple processes. Currently, the crawl queue is held in local memory, which means other processes cannot contribute to the same crawl. To address this, the crawl queue will be transferred to Redis. Storing the queue independently means that even crawlers on separate servers could collaborate on the same crawl.

For more robust queuing, a dedicated distributed task tool, such as Celery, should be considered; however, Redis will be reused here to minimize the number of technologies and dependencies introduced. Here is an implementation of the new Redis-backed queue:

# Based loosely on the Redis Cookbook FIFO Queue:# http://www.rediscookbook.org/implement_a_fifo_queue.html ...

Get Python Web Scraping - Second Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Don’t leave empty-handed

Get Mark Richards’s Software Architecture Patterns ebook to better understand how to design components—and how they should interact.

It’s yours, free.

Get it now

Check it out now on O’Reilly

Dive in for free with a 10-day trial of the O’Reilly learning platform—then explore all the other resources our members count on to build skills and solve problems every day.

Start your free trial Become a member now