Chapter 8. Distributed Training
Training a machine learning model may take a long time, especially if your training dataset is huge or you are using a single machine to do the training. Even if you have a GPU card at your disposal, it can still take weeks to train a complex model such as ResNet50, a computer vision model with 50 convolution layers, trained to classify objects into a thousand categories.
Reducing model training time requires a different approach. You already saw some of the options available: in Chapter 5, for example, you learned to leverage datasets in a data pipeline. Then there are more powerful accelerators, such as GPUs and TPUs (which are exclusively available in Google Cloud).
This chapter will cover a different way to train your model, known as distributed training. Distributed training runs a model training process in parallel on a cluster of devices, such as CPUs, GPUs, and TPUs, to speed up the training process. (In this chapter, for the sake of concision, I will refer to hardware accelerators such as GPUs, CPUs, and TPUs as workers or devices.) After you read this chapter, you will know how to refactor your single-node training routine for distributed training. (Every example you have seen in this book up to this point has been single node: that is, they have all used a machine with one CPU to train the model.)
In distributed training, your model is trained by multiple independent processes. You can think of each process as an independent training ...