11

Training with Multiple Machines

We’ve finally arrived at the last mile of our performance improvement journey. In this last stage, we will broaden our horizons and learn how to distribute the training process across multiple machines or servers. So, instead of using four or eight devices, we can use dozens or hundreds of computing resources to train our models.

An environment comprised of multiple connected servers is usually called a computing cluster or simply a cluster. Such environments are shared among multiple users and have technical particularities such as a high bandwidth and low latency network.

In this chapter, we’ll describe the characteristics of computing clusters that are more relevant to the distributed training process. After ...

Get Accelerate Model Training with PyTorch 2.X now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.