5

Considering Hardware for Deep Learning Training

Training a large deep learning (DL) model is typically a lengthy and data- and resource-hungry process. Considering an extreme case of the GPT-3 NLP model, it took approximately 34 days to train it from scratch using 1,024 NVIDIA A100 GPUs. While it’s unlikely that you will have to train such a large model from scratch, even fine-tuning large DL models on your custom data can take days or even weeks.

Choosing a compute instance type for your specific model is a crucial step that will impact the cost and duration of training. AWS provides a wide spectrum of compute instances for various workload profiles. In this chapter, we will consider the price-performance characteristics of the most suitable ...

Get Accelerate Deep Learning Workloads with Amazon SageMaker now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.