8

Distributed Training at a Glance

When we face a complex problem in real life, we usually try to solve it by dividing the big problem into small parts that are easier to treat. So, by combining the partial solutions obtained from the small pieces of the original problem, we reach the final solution. This strategy, called divide and conquer, is frequently used to solve computational tasks. We can say that this approach is the basis of the parallel and distributed computing areas.

It turns out that this idea of dividing a big problem into small pieces comes in handy to accelerate the training process of complex models. In cases where using a single resource is not enough to train the model in a reasonable time, the unique way out relies on breaking ...

Get Accelerate Model Training with PyTorch 2.X now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.