Operationalizing Deep Learning Training

In Chapter 1, Introducing Deep Learning with Amazon SageMaker, we discussed how SageMaker integrates with CloudWatch Logs and Metrics to provide visibility into your training process by collecting training logs and metrics. However, deep learning (DL) training jobs are prone to multiple types of specific issues related to model architecture and training configuration. Specialized tools are required to monitor, detect, and react to these issues. Since many training jobs run for hours and days on large amounts of compute instances, the cost of errors is high.

When running DL training jobs, you need to be aware of two types of issues:

  • Issues with model and training configuration, which prevent the model ...

Get Accelerate Deep Learning Workloads with Amazon SageMaker now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.