Chapter 7: Profile Training Jobs with Amazon SageMaker Debugger

Training machine learning (ML) models involves experimenting with multiple algorithms, with their hyperparameters typically crunching through large volumes of data. Training a model that yields optimal results is both a time- and compute-intensive task. Improved training time yields improved productivity and reduces overall training costs.

Distributed training, as we discussed in Chapter 6, Training and Tuning at Scale, goes a long way in achieving improved training times by using a scalable compute cluster. However, monitoring training infrastructure to identify and debug resource bottlenecks is not trivial. Once a training job has been launched, the process becomes non-transparent, ...

Get Amazon SageMaker Best Practices now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.