Chapter 1. An Overview of Ray
One of the reasons we need efficient distributed computing is that we’re collecting ever more data with great variety at increasing speeds. The storage systems, data processing, and analytics engines that have emerged in the past decade are crucial to the success of many companies. Interestingly, most “big data” technologies are built for and operated by (data) engineers who are in charge of data collection and processing tasks. The rationale is to free up data scientists to do what they’re best at. As a data science practitioner, you might want to focus on training complex machine learning models, running efficient hyperparameter selection, building entirely new and custom models or simulations, or serving your models to showcase them.
At the same time, it might be inevitable to scale these workloads to a compute cluster. To do that, the distributed system of your choice needs to support all of these fine-grained “big compute” tasks, potentially on specialized hardware. Ideally, it also fits into the big data tool chain you’re using and is fast enough to meet your latency requirements. In other words, distributed computing has to be powerful and flexible enough for complex data science workloads—and Ray can help you with that.
Python is likely the most popular language for data science today; it’s certainly the one we find the most useful for our daily work. Python is now more than 30 years old, but it still has a growing and active community. The rich PyData ecosystem is an essential part of a data scientist’s toolbox. How can you make sure to scale out your workloads while still leveraging the tools you need? That’s a difficult problem, especially since communities can’t be forced to just toss their toolbox or programming language. That means distributed computing tools for data science have to be built for their existing community.
What Is Ray?
What we like about Ray is that it checks all these boxes. It’s a flexible distributed computing framework built for the Python data science community.
Ray is easy to get started and keeps simple things simple. Its core API is as lean as it gets and helps you reason effectively about the distributed programs you want to write. You can efficiently parallelize Python programs on your laptop and run the code you tested locally on a cluster practically without any changes. Its high-level libraries are easy to configure and can seamlessly be used together. Some of them, like Ray’s reinforcement learning library, would likely have a bright future as standalone projects, distributed or not. While Ray’s core is built in C++, it’s been a Python-first framework since day one,1 integrates with many important data science tools, and can count on a growing ecosystem.
Distributed Python is not new, and Ray is not the first framework in this space (nor will it be the last), but it is special in what it has to offer. Ray is particularly strong when you combine several of its modules and have custom, machine learning–heavy workloads that would be difficult to implement otherwise. It makes distributed computing easy enough to run your complex workloads flexibly by leveraging the Python tools you know and want to use. In other words, by learning Ray you get to know flexible distributed Python for machine learning. And showing you how is what this book is all about.
In this chapter you’ll get a first glimpse of what Ray can do for you. We will discuss the three layers that make up Ray: its core engine, high-level libraries, and ecosystem. Throughout the chapter we’ll first show you code examples to give you a feel for Ray. You can view this chapter as a quick preview of the book; we defer any in-depth treatment of Ray’s APIs and components to later chapters.
What Led to Ray?
Programming distributed systems is hard. It requires specific knowledge and experience you might not have. Ideally, such systems get out of your way and provide abstractions to let you focus on your job. But in practice, as Joel Spolsky notes, “all nontrivial abstractions, to some degree, are leaky,” and getting clusters of computers to do what you want is undoubtedly difficult. Many software systems require resources that far exceed what single servers can do. Even if one server were enough, modern systems need to be failsafe and provide features like high availability. That means your applications might have to run on multiple machines, or even datacenters, just to make sure they’re running reliably.
Even if you’re not too familiar with machine learning (ML) or artificial intelligence (AI) more generally, you must have heard of recent breakthroughs in the field. To name just two, systems like Deepmind’s AlphaFold for solving the protein folding problem and OpenAI’s Codex for helping software developers with the tedium of their jobs, have made the news lately. You might also have heard that ML systems generally require large amounts of data to be trained, and that ML models tend to get larger. OpenAI has shown exponential growth in compute needed to train AI models in their paper “AI and Compute”. The number of operations needed for AI systems in their study is measured in petaflops (thousands of trillions of operations per second) and has been doubling every 3.4 months since 2012.
Compare this to Moore’s law,2 which states that the number of transistors in computers would double every two years. Even if you’re bullish on Moore’s law, you can see how there’s a clear need for distributed computing in ML. You should also understand that many tasks in ML can be naturally decomposed to run in parallel. So, why not speed things up if you can?3
Distributed computing is generally perceived as hard. But why is that? Shouldn’t it be realistic to find good abstractions to run your code on clusters without having to constantly think about individual machines and how they interoperate? What if we specifically focused on AI workloads?
Researchers at RISELab at UC Berkeley created Ray to address these questions. They were looking for efficient ways to speed up their workloads by distributing them. The workloads they had in mind were quite flexible in nature and didn’t fit into the frameworks available at the time. RISELab also wanted to build a system that took care of how the work was distributed. With reasonable default behaviors in place, researchers should be able to focus on their work, regardless of the specifics of their compute cluster. And ideally they should have access to all their favorite tools in Python. For this reason, Ray was built with an emphasis on high-performance and heterogeneous workloads.4 To understand these points better, let’s have a closer look at Ray’s design philosophy.
Ray’s Design Principles
Ray is built with several design principles in mind. Its API is designed for simplicity and generality, and its compute model aims for flexibility. Its system architecture is designed for performance and scalability. Let’s look at each of these in more detail.
Simplicity and abstraction
Ray’s API not only banks on simplicity, it’s also intuitive to pick up (as you’ll see in Chapter 2). It doesn’t matter whether you want to use all the CPU cores on your laptop or leverage all the machines in your cluster. You might have to change a line of code or two, but the Ray code you use stays essentially the same. And as with any good distributed system, Ray manages task distribution and coordination under the hood. That’s great, because you’re not bogged down by reasoning about the mechanics of distributed computing. A good abstraction layer allows you to focus on your work, and we think Ray has done a great job of giving you one.
Since Ray’s API is so generally applicable and pythonic, it’s easy to integrate with other tools. For instance, Ray actors can call into or be called by existing distributed Python workloads. In that sense, Ray makes for good “glue code” for distributed workloads, too, as it’s performant and flexible enough to communicate between different systems and frameworks.
Flexibility and heterogeneity
For AI workloads, in particular when dealing with paradigms like reinforcement learning, you need a flexible programming model. Ray’s API is designed to make it easy to write flexible and composable code. Simply put, if you can express your workload in Python, you can distribute it with Ray. Of course, you still need to make sure you have enough resources available and be mindful of what you want to distribute. But Ray doesn’t limit what you can do with it.
Ray is also flexible when it comes to heterogeneity of computations. For instance, let’s say you work on a complex simulation. Simulations can usually be decomposed into several tasks or steps. Some of these steps might take hours to run, others just a few milliseconds, but they always need to be scheduled and executed quickly. Sometimes a single task in a simulation can take a long time, but other, smaller tasks should be able to run in parallel without blocking it. Also, subsequent tasks may depend on the outcome of an upstream task, so you need a framework to allow for dynamic execution that deals well with task dependencies. Ray gives you full flexibility when running heterogeneous workflows like that.
You also need to ensure you are flexible in your resource usage, and Ray supports heterogeneous hardware. For instance, some tasks might have to run on a GPU, while others run best on a couple of CPU cores. Ray provides you with that flexibility.
Speed and scalability
Another of Ray’s design principles is the speed at which Ray executes its tasks. It can handle millions of tasks per second, and you incur very low latencies with it. Ray is built to execute its tasks with just milliseconds of latency.
For a distributed system to be fast, it also needs to scale well. Ray is efficient at distributing and scheduling your tasks across your compute cluster. And it does so in a fault-tolerant way, too. As you’ll learn in detail in Chapter 9, Ray Clusters support autoscaling to support highly elastic workloads. Ray’s autoscaler tries to launch or stop machines in your cluster to match the current demand. This helps both to minimize costs and to ensure that your cluster has enough resources to run your workload.
In distributed systems, it’s not a question of if, but when, things will go wrong. A machine might have an outage, abort a task, or simply go up in flames.5 In any case, Ray is built to recover quickly from failures, which contributes to its overall speed.
As we haven’t talked about Ray’s architecture (Chapter 2 will introduce you to it), we can’t tell you how these design principles are realized just yet. Let’s instead shift our attention to what Ray can do for you in practice.
Three Layers: Core, Libraries, and Ecosystem
Now that you know why Ray was built and what its creators had in mind, let’s look at the three layers of Ray. This presentation is not the only way to slice it, but it’s the way that makes most sense for this book:
-
A low-level, distributed computing framework for Python with a concise core API and tooling for cluster deployment called Ray Core.6
-
A set of high-level libraries built and maintained by the creators of Ray. This includes the so-called Ray AIR to use these libraries with a unified API in common machine learning workloads.
-
A growing ecosystem of integrations and partnerships with other notable projects that span many aspects of the first two layers.
There’s a lot to unpack here, and we’ll look into each of these layers individually in the remainder of this chapter.
You can imagine Ray’s core engine with its API at the center of things, on which everything else builds. Ray’s data science libraries build on top of Ray Core and provide a domain-specific abstraction layer.7 In practice, many data scientists will use these libraries directly, while ML or platform engineers might rely heavily on building their tools as extensions of the Ray Core API. Ray AIR can be seen as an umbrella that links Ray libraries and offers a consistent framework for dealing with common AI workloads. And the growing number of third-party integrations for Ray is another great entry point for experienced practitioners. Let’s look into each one of the layers one by one.
A Distributed Computing Framework
At its core, Ray is a distributed computing framework. We’ll provide you with just the basic terminology here and talk about Ray’s architecture in depth in Chapter 2. In short, Ray sets up and manages clusters of computers so that you can run distributed tasks on them. A Ray Cluster consists of nodes that are connected to each other via a network. You program against the so-called driver, the program root, which lives on the head node. The driver can run jobs, a collection of tasks, that are run on the nodes in the cluster. Specifically, the individual tasks of a job are run on worker processes on worker nodes. Figure 1-1 illustrates the basic structure of a Ray Cluster. Note that we’re not concerned with communication between nodes just yet; this diagram merely shows the layout of a Ray Cluster.
What’s interesting is that a Ray Cluster can also be a local cluster, a cluster consisting of just your own computer. In this case, there’s just one node, namely, the head node, which has the driver process and some worker processes. The default number of worker processes is the number of CPUs available on your machine.
With that knowledge at hand, it’s time to get your hands dirty and run your first local Ray Cluster.
Installing Ray on any of the major operating systems should work seamlessly using pip
:
pip install "ray[rllib, serve, tune]==2.2.0"
With a simple pip install ray
, you will install just the basics of Ray.
Since we want to explore some advanced features, we installed the “extras” rllib
, serve
, and tune
, which we’ll discuss in a bit.8
Depending on your system configuration, you may not need the quotation marks in this installation command.
Next, go ahead and start a Python session.
You could, for instance, use the ipython
interpreter, which is often suitable for following simple examples.
In your Python session you can now easily import and initialize Ray:
import
ray
ray
.
init
()
Note
If you don’t feel like typing in the commands yourself, you can also jump into the Jupyter notebook for this chapter and run the code there. The choice is up to you, but in any case please remember to use Python version 3.7 or later.9
With those two lines of code, you’ve started a Ray Cluster on your local machine. This cluster can utilize all the cores available on your computer as workers. Right now your Ray Cluster doesn’t do much, but that’s about to change.
The init
function you use to start the cluster is one of the six fundamental API
calls that you will learn about in depth in Chapter 2.
Overall, the Ray Core API is very accessible and easy to use.
But since it is also a rather low-level interface,
it takes time to build interesting examples with it.
Chapter 2 has an extensive first example to get you started with the Ray Core API,
and in Chapter 3 you’ll see how to build a more interesting Ray application for reinforcement learning.
In the preceding code you didn’t provide any arguments to the ray.init(...)
function.
If you wanted to run Ray on a “real” cluster, you’d have to pass more arguments to init
.
This init
call is often called the Ray Client,
and it is used to interactively connect to an existing Ray Cluster.10
You can read more about using the Ray Client to connect to your production clusters in the Ray documentation.
Of course, if you’ve ever worked with compute clusters, you know there are many pitfalls and intricacies. For instance, you can deploy Ray applications on clusters hosted by cloud providers such as Amazon Web Services (AWS), Google Cloud Platform (GCP), or Microsoft Azure—and each choice needs good tooling for deployment and maintenance. You can also spin up a cluster on your own hardware or use tools such as Kubernetes to deploy your Ray Clusters. In Chapter 9 (following chapters with concrete Ray applications), we’ll come back to the topic of scaling workloads with Ray Clusters.
Before moving on to Ray’s higher-level libraries, let’s briefly summarize the two foundational components of Ray as a distributed computation framework:
- Ray Clusters
-
This component is in charge of allocating resources, creating nodes, and ensuring they are healthy. A good way to get started with Ray Clusters is its dedicated quick start guide.
- Ray Core
-
Once your cluster is up and running, you use the Ray Core API to program against it. You can get started with Ray Core by following the official walk-through for this component.
A Suite of Data Science Libraries
Moving on to the second layer of Ray, in this section we’ll briefly introduce all the data science libraries that Ray comes with. To do so, let’s first take a bird’s-eye view of what it means to do data science. Once you understand this context, it’s much easier to review Ray’s higher-level libraries and see how they can be useful to you.
Ray AIR and the Data Science Workflow
The somewhat elusive term “data science” (DS) has evolved quite a bit in recent years, and you can find many definitions of varying usefulness online.11 To us, it’s the practice of gaining insights and building real-world applications by leveraging data. That’s quite a broad definition of an inherently practical and applied field that centers around building and understanding things. In that sense, describing practitioners of this field as “data scientists” is about as bad a misnomer as describing hackers as “computer scientists.”12
In broad strokes, doing data science is an iterative process that entails requirements engineering, data collection and processing, building models and evaluating them, and deploying solutions. Machine learning is not necessarily part of this process but often is. If ML is involved, you can further specify some steps:
- Data processing
-
To train ML models, you need data in a format that your ML model understands. The process of transforming and selecting what data should be fed into your model is often called feature engineering. This step can be messy. You’ll benefit a lot if you can rely on common tools to do the job.
- Model training
-
In ML you need to train your algorithms on data that got processed in the previous step. This includes selecting the right algorithm for the job, and it helps if you can choose from a wide variety.
- Hyperparameter tuning
-
Machine learning models have parameters that are tuned in the model training step. Most ML models also have another set of parameters called hyperparameters that can be modified prior to training. These parameters can heavily influence the performance of your resulting ML model and need to be tuned properly. There are good tools to help automate that process.
- Model serving
-
Trained models need to be deployed. To serve a model means to make it available to whomever needs access by whatever means necessary. In prototypes, you often use simple HTTP servers, but there are many specialized software packages for ML model serving.
This list is by no means exhaustive, and there’s a lot more to be said about building ML applications.13 However, it is true that these four steps are crucial for the success of a data science project using ML.
Ray has dedicated libraries for each of the four ML-specific steps we just listed. Specifically, you can take care of your data processing needs with Ray Datasets, run distributed model training with Ray Train, run your reinforcement learning workloads with Ray RLlib, tune your hyperparameters efficiently with Ray Tune, and serve your models with Ray Serve. And the way Ray is built, all these libraries are distributed by design, a point we can’t stress enough.
What’s more is that all of these steps are part of a process and are rarely tackled in isolation. Not only do you want all the libraries involved to seamlessly interoperate, it can also be a decisive advantage if you can work with a consistent API throughout the whole data science process. This is exactly what Ray AIR was built for: having a common runtime and API for your experiments and the ability to scale your workloads when you’re ready. Figure 1-2 shows a quick overview of all the components of AIR.
While introducing the Ray AI Runtime API would be too much for this chapter (you can jump ahead to Chapter 10 for that), we’ll introduce you to all the building blocks that feed into it. Let’s go through each of Ray’s DS libraries one by one.
Data Processing with Ray Datasets
The first high-level library of Ray we’ll talk about is Ray Datasets.
This library contains a data structure aptly called Dataset
, a multitude of connectors
for loading data from various formats and systems, an API for transforming such datasets,
a way to build data processing pipelines with them, and many integrations with other data processing frameworks.
The Dataset
abstraction builds on the powerful Arrow framework.14
To use Ray Datasets, you need to install Arrow for Python, for instance by running
pip install pyarrow
.
The following simple example creates a distributed Dataset
on your local Ray Cluster from a Python data structure. Specifically, you’ll create a dataset from a Python dictionary containing a string name
and an integer-valued data
for 10,000 entries:
import
ray
items
=
[
{
"
name
"
:
str
(
i
)
,
"
data
"
:
i
}
for
i
in
range
(
10000
)
]
ds
=
ray
.
data
.
from_items
(
items
)
ds
.
show
(
5
)
Creating a
Dataset
by usingfrom_items
from theray.data
module.Printing the first five items of the
Dataset
.
To show
a Dataset
means to print some of its values.
You should see precisely five elements on your command line, like this:
{
'name'
:
'0'
,
'data'
:
0
}
{
'name'
:
'1'
,
'data'
:
1
}
{
'name'
:
'2'
,
'data'
:
2
}
{
'name'
:
'3'
,
'data'
:
3
}
{
'name'
:
'4'
,
'data'
:
4
}
Great, now you have some rows, but what can you do with that data?
The Dataset
API bets heavily on functional programming,
as this paradigm is well suited for data transformations.
Even though Python 3 made a point of hiding some of its functional programming capabilities,
you’re probably familiar with functionality such as map
, filter
, flat_map
, and others.
If not, it’s easy enough to pick up:
map
takes each element of your dataset and transforms it into something else, in parallel;
filter
removes data points according to a Boolean filter function; and the slightly more elaborate flat_map
first maps values similarly to map
, but then it also “flattens” the result.
For instance, if map
produced a list of lists, flat_map
would flatten out the nested lists and give you just a list.
Equipped with these three functional API calls,15 let’s see how easily you can transform your dataset ds
:
squares
=
ds
.
map
(
lambda
x
:
x
[
"
data
"
]
*
*
2
)
evens
=
squares
.
filter
(
lambda
x
:
x
%
2
==
0
)
evens
.
count
(
)
cubes
=
evens
.
flat_map
(
lambda
x
:
[
x
,
x
*
*
3
]
)
sample
=
cubes
.
take
(
10
)
(
sample
)
We
map
each row ofds
to only keep the square value of itsdata
entry.Then we
filter
thesquares
to keep only even numbers (a total of five thousand elements).We then use
flat_map
to augment the remaining values with their respective cubes.To
take
a total of 10 values means to leave Ray and return a Python list with these values that we can print.
The drawback of Dataset
transformations is that each step gets executed synchronously.
In this example that is a nonissue, but for complex tasks that, for example, mix reading files and processing data, you would want an execution that can overlap individual tasks.
DatasetPipeline
does exactly that.
Let’s rewrite the previous example into a pipeline:
pipe
=
ds
.
window
(
)
result
=
pipe
\
.
map
(
lambda
x
:
x
[
"
data
"
]
*
*
2
)
\
.
filter
(
lambda
x
:
x
%
2
==
0
)
\
.
flat_map
(
lambda
x
:
[
x
,
x
*
*
3
]
)
result
.
show
(
10
)
You can turn a
Dataset
into a pipeline by calling.window()
on it.Pipeline steps can be chained to yield the same result as before.
There’s a lot more to be said about Ray Datasets, especially its integration with notable data processing systems, but we’ll defer an in-depth discussion until Chapter 6.
Model Training
Moving on to the next set of libraries, let’s look at the distributed training capabilities of Ray. For that, you have access to two libraries. One is dedicated to reinforcement learning specifically; the other one has a different scope and is aimed primarily at supervised learning tasks.
Reinforcement learning with Ray RLlib
Let’s start with Ray RLlib for reinforcement learning (RL). This library is powered by the modern ML frameworks TensorFlow and PyTorch, and you can choose which one to use. Both frameworks seem to converge more and more conceptually, so you can pick the one you like most without losing much in the process. Throughout the book we use both TensorFlow and PyTorch examples so you can get a feel for both frameworks when using Ray.
For this section, go ahead and install TensorFlow with pip install tensorflow
right now.16 To run the code example, you also need to install the gym
library with pip install "gym==0.25.0"
.
One of the easiest ways to run examples with RLlib is to use the command-line tool rllib
, which we already installed implicitly when we ran pip install "ray[rllib]"
.
Once you run more complex examples in Chapter 4, you will mostly rely on its Python API,
but for now we want to get a first taste of running RL experiments with RLlib.
We’ll look at a fairly classic control problem of balancing a pole on a cart. Imagine you have a pole like the one in Figure 1-3, fixed at a joint of a cart, and subject to gravity. The cart is free to move along a frictionless track, and you can manipulate the cart by giving it a push from the left or the right with a fixed force. If you do this well enough, the pole will remain in an upright position. For each time step the pole didn’t fall over, we get a reward of 1. Collecting a high reward is our goal, and the question is whether we can teach a reinforcement learning algorithm to do this for us.
Specifically, we want to train a reinforcement learning agent that can carry out two actions, namely, push to the left or to the right, observe what happens when interacting with the environment in that way, and learn from the experience by maximizing the reward.
To tackle this problem with Ray RLlib, we can use a so-called tuned example, which
is a preconfigured algorithm that runs well for a given problem.
You can run a tuned example with a single command.
RLlib comes with many such examples, and you can list them all with rllib example list
.
One of the available examples is cartpole-ppo
, a tuned example that uses the PPO algorithm to
solve the cart–pole problem, specifically, the CartPole-v1
environment from OpenAI Gym.
You can take a look at the configuration of this example by typing rllib example get cartpole-ppo
,
which will first download the example file from GitHub and then print its configuration.
This configuration is encoded in YAML file format and reads as follows:
cartpole-ppo
:
env
:
CartPole-v1
run
:
PPO
stop
:
episode_reward_mean
:
150
timesteps_total
:
100000
config
:
framework
:
tf
gamma
:
0.99
lr
:
0.0003
num_workers
:
1
observation_filter
:
MeanStdFilter
num_sgd_iter
:
6
vf_loss_coeff
:
0.01
model
:
fcnet_hiddens
:
[
32
]
fcnet_activation
:
linear
vf_share_layers
:
true
enable_connectors
:
True
The
CartPole-v1
environment simulates the problem we just described.Use a powerful RL algorithm called Proximal Policy Optimization, or PPO.
Once we reach a reward of 150, stop the experiment.
PPO needs some RL-specific configuration to make it work for this problem.
The details of this configuration file don’t matter much at this point, so don’t get distracted by them.
The important part is that you specify the Cartpole-v1
environment and sufficient
RL-specific configuration to ensure the training procedure works.
Running this configuration doesn’t require any special hardware and finishes in a matter of minutes.
To train this example, you’ll have to install the PyGame dependency with pip install pygame
and then simply run:
rllibexample
run
cartpole-ppo
If you run this, RLlib creates a named experiment
and logs important metrics such as the reward
or the episode_reward_mean
for you.
In the output of the training run, you should also see information about the machine
(loc
, meaning hostname and port), as well as the status of your training runs.
If your run is TERMINATED
but you’ve never seen a successfully RUNNING
experiment
in the log, something must have gone wrong.
Here’s a sample snippet of a training run:
+-----------------------------+----------+----------------+ | Trial name | status | loc | |-----------------------------+----------+----------------| | PPO_CartPole-v0_9931e_00000 | RUNNING | 127.0.0.1:8683 | +-----------------------------+----------+----------------+
When the training run finishes and things went well, you should see the following output:
Your training finished. Best available checkpoint for each trial: <checkpoint-path>/checkpoint_<number>
You can now evaluate your trained algorithm from any checkpoint, for example, by running:
╭─────────────────────────────────────────────────────────────────────────╮ │ rllib evaluate <checkpoint-path>/checkpoint_<number> --algo PPO │ ╰─────────────────────────────────────────────────────────────────────────╯
Your local Ray checkpoint folder is ~/ray-results by default. For the training configuration we used, your <checkpoint-path> should be of the form ~/ray_results/cartpole-ppo/PPO_CartPole-v1_<experiment_id>. During the training procedure, your intermediate and final model checkpoints get generated into this folder.
To evaluate the performance of your trained RL algorithm, you can now evaluate it from checkpoint by copying the command the previous example training run printed:
rllibevaluate
<checkpoint-path>/checkpoint_<number>
--algo
PPO
Running this command will print evaluation results, namely, the rewards achieved by your
trained RL algorithm on the CartPole-v1
environment.
There’s much more that you can do with RLlib, and we’ll cover more of it in Chapter 4.
The point of this example was to show you how easily you can get started with RLlib and
the rllib
command-line tool, just by leveraging the example
and evaluate
commands.
Distributed training with Ray Train
Ray RLlib is dedicated to reinforcement learning, but what do you do if you need to train models for other types of machine learning, like supervised learning? You can use another Ray library for distributed training in this case: Ray Train. At this point, we don’t have enough knowledge of frameworks such as TensorFlow to give you a concise and informative example for Ray Train. If you’re interested in distributed training, you can jump ahead to Chapter 6.
Hyperparameter Tuning
Naming things is hard, but Ray Tune, which you can use to tune all sorts of parameters, hits the spot. It was built specifically to find good hyperparameters for machine learning models. The typical setup is as follows:
-
You want to run an extremely computationally expensive training function. In ML, it’s not uncommon to run training procedures that take days, if not weeks, but let’s say you’re dealing with just a couple of minutes.
-
As a result of training, you compute a so-called objective function. Usually you want to either maximize your gains or minimize your losses in terms of performance of your experiment.
-
The tricky bit is that your training function might depend on certain parameters, called hyperparameters, that influence the value of your objective function.
-
You may have a hunch what individual hyperparameters should be, but tuning them all can be difficult. Even if you can restrict these parameters to a sensible range, it’s usually prohibitive to test a wide range of combinations. Your training function is simply too expensive.
What can you do to efficiently sample hyperparameters and get “good enough” results on your objective? The field concerned with solving this problem is called hyperparameter optimization (HPO), and Ray Tune has an enormous suite of algorithms for tackling it. Let’s look an example of Ray Tune used for the situation we just explained. The focus is yet again on Ray and its API, not on a specific ML task (which we simply simulate for now):
from
ray
import
tune
import
math
import
time
def
training_function
(
config
)
:
x
,
y
=
config
[
"
x
"
]
,
config
[
"
y
"
]
time
.
sleep
(
10
)
score
=
objective
(
x
,
y
)
tune
.
report
(
score
=
score
)
def
objective
(
x
,
y
)
:
return
math
.
sqrt
(
(
x
*
*
2
+
y
*
*
2
)
/
2
)
result
=
tune
.
run
(
training_function
,
config
=
{
"
x
"
:
tune
.
grid_search
(
[
-
1
,
-
.5
,
0
,
.5
,
1
]
)
,
"
y
"
:
tune
.
grid_search
(
[
-
1
,
-
.5
,
0
,
.5
,
1
]
)
}
)
(
result
.
get_best_config
(
metric
=
"
score
"
,
mode
=
"
min
"
)
)
Simulate an expensive training function that depends on two hyperparameters,
x
andy
, read from aconfig
.After sleeping for 10 seconds to simulate training and computing the objective, the score is reported to
tune
.The objective computes the mean of the squares of
x
andy
and returns the square root of this term. This type of objective is fairly common in ML.Use
tune.run
to initialize hyperparameter optimization on ourtraining_function
.A key part is to provide a parameter space for
x
andy
fortune
to search over.
Notice how the output of this run is structurally similar to what you saw in the RLlib example.
That’s no coincidence, as RLlib (like many other Ray libraries) uses Ray Tune under the hood.
If you look closely, you will see PENDING
runs that wait for execution, as well as
RUNNING
and TERMINATED
runs.
Tune takes care of selecting, scheduling, and executing your training runs automatically.
Specifically, this Tune example finds the best possible choices of parameters x
and y
for a training_function
with a given objective
we want to minimize.
Even though the objective function might look a little intimidating at first, since we compute the sum of squares of x
and y
, all values will be non-negative.
That means the smallest value is obtained at x=0
and y=0
, which evaluates the objective function to 0
.
We do a so-called grid search over all possible parameter combinations.
As we explicitly pass in 5 possible values for both x
and y
, that’s a total of 25 combinations that get fed into the training function.
Since we instruct training_function
to sleep for 10 seconds, testing all combinations of hyperparameters sequentially would take more than 4 minutes total.
Since Ray is smart about parallelizing this workload, this whole experiment took only about 35 seconds for us, but it might take much longer, depending on where you run it.
Now, imagine each training run would have taken several hours, and we’d have 20 instead of 2 hyperparameters. That makes grid search infeasible, especially if you don’t have educated guesses on the parameter range. In such situations you’ll have to use more elaborate HPO methods from Ray Tune, as discussed in Chapter 5.
Model Serving
The last of Ray’s high-level libraries we’ll discuss specializes in model serving and is simply called Ray Serve. To see an example of it in action, you need a trained ML model to serve. Luckily, nowadays, you can find many interesting models on the internet that have already been trained for you. For instance, Hugging Face has a variety of models available for you to download directly in Python. The model we’ll use is a language model called GPT-2 that takes text as input and produces text to continue or complete the input. For example, you can prompt a question and GPT-2 will try to complete it.
Serving such a model is a good way to make it accessible. You may not know how to load and run a TensorFlow model on your computer, but you do know how to ask a question in plain English. Model serving hides the implementation details of a solution and lets users focus on providing inputs and understanding outputs of a model.
To proceed, make sure to run pip install transformers
to install the Hugging Face library that has the model we want to use.17
With that we can now import and start an instance of Ray’s serve
library, load and deploy a GPT-2 model, and ask it for the meaning of life, like so:
from
ray
import
serve
from
transformers
import
pipeline
import
requests
serve
.
start
(
)
@serve
.
deployment
def
model
(
request
)
:
language_model
=
pipeline
(
"
text-generation
"
,
model
=
"
gpt2
"
)
query
=
request
.
query_params
[
"
query
"
]
return
language_model
(
query
,
max_length
=
100
)
model
.
deploy
(
)
query
=
"
What
'
s the meaning of life?
"
response
=
requests
.
get
(
f
"
http://localhost:8000/model?query=
{
query
}
"
)
(
response
.
text
)
Start
serve
locally.The
@serve.deployment
decorator turns a function with arequest
parameter into aserve
deployment.Loading
language_model
inside themodel
function for every request is inefficient, but it’s the quickest way to show you a deployment.Ask the model to give us at most 100 characters to continue our query.
Formally deploy the model so that it can start receiving requests over HTTP.
Use the indispensable requests library to get a response for any question you might have.
In Chapter 9 you will learn how to properly deploy models in various scenarios, but for now we encourage you to play around with this example and test different queries. Running the last two lines of code repeatedly will give you different answers practically every time. Here’s a darkly poetic gem, raising more questions, from one query that we’ve slightly censored for underaged readers:
[{ "generated_text": "What's the meaning of life?\n\n Is there one way or another of living?\n\n How does it feel to be trapped in a relationship?\n\n How can it be changed before it's too late? What did we call it in our time?\n\n Where do we fit within this world and what are we going to live for?\n\n My life as a person has been shaped by the love I've received from others." }]
This concludes our whirlwind tour of Ray’s data science libraries, the second of Ray’s layers. Ultimately, all high-level Ray libraries presented in this chapter are extensions of the Ray Core API. Ray makes it relatively easy to build new extensions, and there are a few more that we can’t discuss in full in this book. For instance, there is the relatively recent addition of Ray Workflows, which allows you to define and run long-running applications with Ray.
Before we wrap up this chapter, let’s have a very brief look at the third layer, the growing ecosystem around Ray.
A Growing Ecosystem
Ray’s high-level libraries are powerful and deserve a much deeper treatment throughout the book. While their usefulness for the data science experimentation lifecycle is undeniable, we also don’t want to give the impression that Ray is all you need from now on. No surprise, the best and most successful frameworks are the ones that integrate well with existing solutions and ideas. It’s better to focus on your core strengths and leverage other tools for what’s missing in your solution, and Ray does this quite well.
Throughout the book, and in Chapter 11 in particular, we will discuss many useful third-party libraries built on top of Ray. The Ray ecosystem also has a lot of integrations with existing tools. To give you an example of that, recall that Ray Datasets is Ray’s data loading and compute library. If you happen to have an existing project that already uses data processing engines like Spark or Dask,18 you can use those tools together with Ray. Specifically, you can run the entire Dask ecosystem on top of a Ray Cluster using the Dask-on-Ray scheduler, or you can use the Spark on Ray project to integrate your Spark workloads with Ray. Likewise, the Modin project is a distributed drop-in replacement for Pandas DataFrames that uses Ray (or Dask) as a distributed execution engine (“Pandas on Ray”).
The common theme here is that Ray doesn’t try to replace all these tools, but rather integrates with them while still giving you access to its native Ray Datasets library. We’ll go into much more detail about the relationship of Ray with other tools in the broader ecosystem in Chapter 11.
One important aspect of many Ray libraries is that they seamlessly integrate common tools as backends. Ray often creates common interfaces, instead of trying to create new standards.19 These interfaces allow you to run tasks in a distributed fashion, a property most of the respective backends don’t have, or not to the same extent. For instance, Ray RLlib and Train are backed by the full power of TensorFlow and PyTorch. And Ray Tune supports algorithms from practically every notable HPO tool available, including Hyperopt, Optuna, Nevergrad, Ax, SigOpt, and many others. None of these tools is distributed by default, but Tune unifies them in a common interface for distributed workloads.
Summary
Figure 1-4 gives you an overview of the three layers of Ray as we laid them out. Ray’s core distributed execution engine sits at the center of the framework. The Ray Core API is a versatile library for distributed computing, and Ray Clusters allow you to deploy your workloads in a variety of ways.
For practical data science workflows you can use Ray Datasets for data processing, Ray RLlib for reinforcement learning, Ray Train for distributed model training, Ray Tune for hyperparameter tuning, and Ray Serve for model serving. You’ve seen examples for each of these libraries and have an idea of what their APIs entail. Ray AIR provides a unified API for all other Ray ML libraries and was built with the needs of data scientists in mind.
On top of that, Ray’s ecosystem has many extensions, integrations, and backends that we’ll look more into later. Maybe you can already spot a few tools you know and like in Figure 1-4?
The Ray Core API sits at the center of Figure 1-4, surrounded by the libraries RLlib, Ray Tune, Ray Train, Ray Serve, Ray Datasets, and the many third-party integrations that are too many to list here.
1 By “Python-first” we mean that all higher-level libraries are written in Python and that the development of new features is driven by the needs of the Python community. Having said this, Ray has been designed to support multiple language bindings and, for example, comes with a Java API. So, it’s not out of the question that Ray might support other languages that are important to the data science ecosystem.
2 Moore’s law held for a long time, but there might be signs that it’s slowing down. Some even say it’s dead. We’re not here to argue these points. What’s important is not that our computers generally keep getting faster, but the relation to the amount of compute we need.
3 There are many ways to speed up ML training, from basic to sophisticated. For instance, we’ll spend a considerable amount of time elaborating on distributed data processing in Chapter 6 and distributed model training in Chapter 7.
4 Anyscale, the company behind Ray, is building a managed Ray platform and offers hosted solutions for your Ray applications.
5 This might sound drastic, but it’s not a joke. To name just one example, in March 2021 a French datacenter powering millions of websites burned down completely. If your whole cluster burns down, we’re afraid Ray can’t help you.
6 This is a Python book, so we’ll exclusively focus on Python, but you should know that Ray also has a Java API, which is less mature than its Python equivalent at this point.
7 One of the reasons so many libraries are built on top of Ray Core is that it’s so lean and straightforward to reason about. One of the goals of this book is to inspire you to write your own applications, or even libraries, with Ray.
8 We generally introduce dependencies in this book only when we need them, which should make it easier to follow along. In contrast, the notebooks on GitHub give you the option to install all dependencies up front so that you can focus on running the code instead.
9 At the time of this writing, there’s no Python 3.10 support for Ray, so sticking to a version between 3.7 and 3.9 should work best to follow this book.
10 There are other means of interacting with Ray Clusters, such as the Ray Jobs CLI.
11 We never liked the categorization of data science as an intersection of disciplines, like math, coding, and business. Ultimately, that doesn’t tell you what practitioners do.
12 As a fun exercise, we recommend reading Paul Graham’s famous “Hackers and Painters” essay on this topic and replace “computer science” with “data science.” What would hacking 2.0 be?
13 If you want to understand more about the holistic view of the data science process when building ML applications, Building Machine Learning Powered Applications by Emmanuel Ameisen (O’Reilly) is entirely dedicated to it.
14 In Chapter 6 we will introduce you to the fundamentals of what makes Ray Datasets work, including its use of Arrow. For now, we want to focus on its API and concrete usage patterns.
15 We’ll elaborate more on this in later chapters, specifically in Chapter 6, but note that Ray Datasets is not meant as a general-purpose data processing library. Tools such as Spark have more mature and optimized support for large-scale data processing.
16 If you’re on a Mac, you’ll have to install tensorflow-macos
. In general, if you encounter any issues installing Ray or its dependencies on your system, please refer to the installation guide.
17 Depending on the operating system you’re using, you may need to install the Rust compiler first to make this work. For instance, on a Mac, you can install it with brew install rust
.
18 Spark was created by another lab in Berkeley, AMPLab. The internet is full of blog posts claiming that Ray should therefore be seen as a replacement of Spark. It’s better to think of them as tools with different strengths that are both likely here to stay.
19 Before the deep learning framework Keras became an official part of TensorFlow, it started out as a convenient API specification for various lower-level frameworks such as Theano or CNTK. In that sense, Ray RLlib has the chance to become “Keras for RL,” and Ray Tune might just be “Keras for HPO.” The missing piece for more adoption might just be a more elegant API for both.
Get Learning Ray now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.