Chapter 7. Distributed Training with Ray Train

In Chapter 6 we discussed how to train copies of a simple model on shards of data using Ray Datasets—but there’s much more to distributed training than that. As we indicated in Chapter 1, Ray has a dedicated library for distributed training called Ray Train. It comes with an extensive suite of machine learning training integrations and allows you to scale your experiments seamlessly on Ray Clusters.

We will start this chapter by showing you why you might need to scale your ML training and then introduce you to the different ways of doing so. After that, we’ll introduce Ray Train and walk through an extensive end-to-end example. We’ll also cover some key concepts you need to know to use Ray Train, such as preprocessors, trainers, and checkpoints. Finally, we’ll cover some of the more advanced functionality that Ray Train provides. As always, you can use the notebook for this chapter to follow along.

The Basics of Distributed Model Training

Machine learning often requires a lot of heavy computation. Depending on the type of model that you’re training, whether it be a gradient boosted tree or a neural network, you may face some common problems with training ML models:

  • The time it takes to finish training is too long.

  • The data is too large to fit into one machine.

  • The model itself is too large to fit into a single machine.

For the first case, training can be accelerated by processing data with increased ...

Get Learning Ray now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.