Chapter 8. Distributed Training

Training a machine learning model may take a long time, especially if your training dataset is huge or you are using a single machine to do the training. Even if you have a GPU card at your disposal, it can still take weeks to train a complex model such as ResNet50, a computer vision model with 50 convolution layers, trained to classify objects into a thousand categories.

Reducing model training time requires a different approach. You already saw some of the options available: in Chapter 5, for example, you learned to leverage datasets in a data pipeline. Then there are more powerful accelerators, such as GPUs and TPUs (which are exclusively available in Google Cloud).

This chapter will cover a different way to train your model, known as distributed training. Distributed training runs a model training process in parallel on a cluster of devices, such as CPUs, GPUs, and TPUs, to speed up the training process. (In this chapter, for the sake of concision, I will refer to hardware accelerators such as GPUs, CPUs, and TPUs as workers or devices.) After you read this chapter, you will know how to refactor your single-node training routine for distributed training. (Every example you have seen in this book up to this point has been single node: that is, they have all used a machine with one CPU to train the model.)

In distributed training, your model is trained by multiple independent processes. You can think of each process as an independent training ...

Get TensorFlow 2 Pocket Reference now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.