6 Engineering Distributed Training

In the previous chapter, we discussed how to select optimal hardware for the Deep Learning (DL) training job and optimize your model for the target hardware platform. In this chapter, we will consider, in depth, how to design efficient distributed training on Amazon SageMaker given your particular use case and model architecture.

There are two specific problems that distributed training aims to address. The first problem is how to reduce the training time of large models by distributing training tasks across multiple compute devices. Another problem arises when we need to train large models that cannot fit into the memory of a single GPU device. This problem is especially relevant for NLP tasks where it’s shown ...

Get Accelerate Deep Learning Workloads with Amazon SageMaker now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Accelerate Deep Learning Workloads with Amazon SageMaker by Vadim Dabravolski

6

Engineering Distributed Training

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly