The most important takeaway of this chapter is that if you are about to perform distributed crawling, always use suitably sized batches.
Depending on how fast your source websites respond, you may have hundreds, thousands, or tens of thousands of URLs. You would like them to be large enough—in the few-minutes level—so that any startup costs are amortized sufficiently. On the other hand, you wouldn't like them to be too large as this would turn a machine failure to a major risk. In a fault-tolerant distributed system, you would retry failed batches; and you wouldn't want this to be hours worth of work.