Chapter 9: Scaling Your Training Jobs

In the four previous chapters, you learned how to train models with built-in algorithms, frameworks, or your own code.

In this chapter, you'll learn how to scale training jobs, allowing them to train on larger datasets while keeping training time and cost under control. We'll start by discussing when and how to take scaling decisions, thanks to monitoring information and simple guidelines. You'll also see how to collect profiling information with Amazon SageMaker Debugger, in order to understand how efficient your training jobs are. Then, we'll look at several key techniques for scaling: pipe mode, distributed training, data parallelism, and model parallelism. After that, we'll launch a large training job ...

Get Learn Amazon SageMaker - Second Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Learn Amazon SageMaker - Second Edition by Julien Simon

Chapter 9: Scaling Your Training Jobs

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly