Chapter 8. Distributed Training
Training a machine learning model may take a long time, especially if your training dataset is huge. This is especially common when you are using a single machine to do the training. Even if you have a graphic processing unit (GPU) card at your disposal, it can still take weeks to train a complex model such as ResNet50.
Reducing model training time requires a different approach. You have already seen some of the options available: in Chapter 5, for example, you learned to leverage datasets in a data pipeline. Then there are more powerful accelerators, such as GPUs and TensorFlow processing units (TPUs – which are exclusively available in Google Cloud).
This chapter will cover a different way of training your model, known as distributed training. Distributed training runs a model training process in parallel on a cluster of devices, such as CPUs, GPUs, and TPUs, to make training faster. (In this ...