Chapter 8. Distributed Training

Training a machine learning model may take a long time, especially if your training dataset is huge. This is especially common when you are using a single machine to do the training. Even if you have a graphic processing unit (GPU) card at your disposal, it can still take weeks to train a complex model such as ResNet50.

Reducing model training time requires a different approach. You have already seen some of the options available: in Chapter 5, for example, you learned to leverage datasets in a data pipeline. Then there are more powerful accelerators, such as GPUs and TensorFlow processing units (TPUs – which are exclusively available in Google Cloud).

This chapter will cover a different way of training your model, known as distributed training. Distributed training runs a model training process in parallel on a cluster of devices, such as CPUs, GPUs, and TPUs, to make training faster. (In this ...

Get TensorFlow 2 Pocket Reference now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.