April 2024
Intermediate to advanced
230 pages
5h 12m
English
We’ve finally arrived at the last mile of our performance improvement journey. In this last stage, we will broaden our horizons and learn how to distribute the training process across multiple machines or servers. So, instead of using four or eight devices, we can use dozens or hundreds of computing resources to train our models.
An environment comprised of multiple connected servers is usually called a computing cluster or simply a cluster. Such environments are shared among multiple users and have technical particularities such as a high bandwidth and low latency network.
In this chapter, we’ll describe the characteristics of computing clusters that are more relevant to the distributed training process. After ...