Chapter 9: Training ML Models at Scale in SageMaker Studio
A typical ML life cycle starts with prototyping and will transition to a production scale where the data gets larger, models get more complicated, and the runtime environment gets more complex. Getting a training job done requires the right set of tools. Distributed training using multiple computers to share the load addresses situations that involve large datasets and large models. However, as complex ML training jobs use more compute resources, and more costly infrastructure (such as Graphical Processing Units (GPUs)), being able to effectively train a complex ML model on large data is important for a data scientist and an ML engineer. Being able to see and monitor how a training script ...
Get Getting Started with Amazon SageMaker Studio now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.