Chapter 7: Profile Training Jobs with Amazon SageMaker Debugger

Training machine learning (ML) models involves experimenting with multiple algorithms, with their hyperparameters typically crunching through large volumes of data. Training a model that yields optimal results is both a time- and compute-intensive task. Improved training time yields improved productivity and reduces overall training costs.

Distributed training, as we discussed in Chapter 6, Training and Tuning at Scale, goes a long way in achieving improved training times by using a scalable compute cluster. However, monitoring training infrastructure to identify and debug resource bottlenecks is not trivial. Once a training job has been launched, the process becomes non-transparent, ...

Get Amazon SageMaker Best Practices now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.