Chapter 4. Reinforcement Learning with Ray RLlib
In Chapter 3 you built an RL environment, a simulation to play out some games, an RL algorithm, and the code to parallelize the training of the algorithm—all completely from scratch. It’s good to know how to do all that, but in practice the only thing you really want to do when training RL algorithms is the first part, namely, specifying your custom environment, the “game” you want to play.1 Most of your efforts will go into selecting the right algorithm, setting it up, finding the best parameters for the problem, and generally focusing on training a well-performing policy.
Ray RLlib is an industry-grade library for building RL algorithms at scale. You’ve already seen a first example of RLlib in Chapter 1, but in this chapter we’ll go into much more depth. The great thing about RLlib is that it’s a mature library for developers that comes with good abstractions to work with. As you will see, many of these abstractions you already know from the previous chapter.
We start out by giving you an overview of RLlib’s capabilities. Then we quickly revisit the maze game from Chapter 3 and show you how to tackle it both with the RLlib CLI and the RLlib Python API in a few lines of code. You’ll see how easy RLlib is to get started before learning about its key concepts, such as RLlib environments and algorithms.
We’ll also take a closer look at some advanced RL topics that are extremely useful in practice but are not often properly supported in other RL libraries. For instance, you will learn how to create a curriculum for your RL agents so that they can learn simple scenarios before moving on to more complex ones. You will also see how RLlib deals with having multiple agents in a single environment and how to leverage experience data that you’ve collected outside your current application to improve your agent’s performance.
An Overview of RLlib
Before we dive into any examples, let’s quickly discuss what RLlib is and what it can do. As part of the Ray ecosystem, RLlib inherits all the performance and scalability benefits of Ray. In particular, RLlib is distributed by default, so you can scale your RL training to as many nodes as you want.
Another benefit of being built on top of Ray is that RLlib integrates tightly with other Ray libraries. For instance, the hyperparameters of any RLlib algorithm can be tuned with Ray Tune, as we will see in Chapter 5. You can also seamlessly deploy your RLlib models with Ray Serve.2
What’s extremely useful is that RLlib works with both of the predominant deep learning frameworks at the time of this writing: PyTorch and TensorFlow. You can use either one of them as your backend and can easily switch between them, often by changing just one line of code. That’s a huge benefit, as companies are often locked into their underlying deep learning framework and can’t afford to switch to another system and rewrite their code.
RLlib also has a track record of solving real-world problems and is a mature library used by many companies to bring their RL workloads to production. The RLlib API appeals to many engineers, as it offers the right level of abstraction for many applications while still being flexible enough to be extended.
Apart from these more general benefits, RLlib has a lot of RL-specific features that we will cover in this chapter. In fact, RLlib is so feature rich that it would deserve a book on its own, which means we can touch on just some aspects of it here. For instance, RLlib has a rich library of advanced RL algorithms to choose from. In this chapter we will focus on a few select ones, but you can track the growing list of options on the RLlib algorithms page. RLlib also has many options for specifying RL environments and is very flexible in handling them during training; for an overview of RLlib environments see the documentation.
Getting Started with RLlib
To use RLlib, make sure you have installed it on your computer:
pip install "ray[rllib]==2.2.0"
Note
Check out the accompanying notebook for this chapter if you don’t feel like typing the code yourself.
Every RL problem starts with having an interesting environment to investigate. In Chapter 1 we looked at the classical cart–pole balancing problem. Recall that we didn’t implement this cart–pole environment; it came out of the box with RLlib.
In contrast, in Chapter 3 we implemented a simple maze game on our own.
The problem with this implementation is that we can’t directly use it with RLlib
or any other RL library for that matter.
The reason is that in RL you have ubiquitous standards, and our environments need
to implement certain interfaces.
The best known and most widely used library for RL environments is gym
,
an open source Python project from OpenAI.
Let’s have a look at what Gym is and how to convert our maze Environment
from the previous chapter to a Gym environment compatible with RLlib.
Building a Gym Environment
If you look at the well-documented and easy-to-read gym.Env
environment
interface on GitHub,
you’ll notice that an implementation of this interface has two mandatory class
variables and three methods that subclasses need to implement.
You don’t have to check the source code, but we do encourage you to have a look.
You might just be surprised by how much you already know about these environments.
In short, the interface of a Gym environment looks like the following pseudocode:
import
gym
class
Env
:
action_space
:
gym
.
spaces
.
Space
observation_space
:
gym
.
spaces
.
Space
def
step
(
self
,
action
)
:
.
.
.
def
reset
(
self
)
:
.
.
.
def
render
(
self
,
mode
=
"
human
"
)
:
.
.
.
The
gym.Env
interface has an action and an observation space.The
Env
can run astep
and returns a tuple of observations, reward, done condition, and further info.An
Env
canreset
itself, which will return the initial observations of a new episode.We can
render
anEnv
for different purposes, such as for human display or as a string representation.
You’ll recall from Chapter 3 that this is very similar to
the interface of the maze
Environment
we built there.
In fact, Gym has a so-called Discrete
space implemented in gym.spaces
,
which means we can make our maze Environment
a gym.Env
as follows.
We assume that you store this code in a file called maze_gym_env.py and that the
code for the Environment
from Chapter 3 is located at the top of that
file (or is imported there):
# maze_gym_env.py | Original definition of Environment goes at the top.
import
gym
from
gym
.
spaces
import
Discrete
class
GymEnvironment
(
Environment
,
gym
.
Env
)
:
def
__init__
(
self
,
*
args
,
*
*
kwargs
)
:
"""Make our original Environment a gym `Env`."""
super
(
)
.
__init__
(
*
args
,
*
*
kwargs
)
gym_env
=
GymEnvironment
(
)
Replace our own
Discrete
implementation with that of Gym.Make the
GymEnvironment
implement agym.Env
. The interface is essentially the same as before.
Of course, we could have made our original Environment
implement gym.Env
by
simply inheriting from it in the first place.
But the point is that the gym.Env
interface comes up so naturally in the context
of RL that it is a good exercise to implement it without having to resort to external
libraries.3
The gym.Env
interface also comes with helpful utility functionality and
many interesting example implementations.
For instance, the CartPole-v1
environment we used in Chapter 1 is an example from Gym,4 and there are many other environments available
to test your RL algorithms.
Running the RLlib CLI
Now that we have our GymEnvironment
implemented as a gym.Env
,
here’s how you can use it with RLlib.
You’ve seen the RLlib CLI in action in Chapter 1,
but this time the situation is a bit different.
In the first chapter we simply ran a tuned example using the rllib example
command.
This time around we want to bring our own gym
environment class,
namely, the class GymEnvironment
that we defined in maze_gym_env.py.
To specify this class in Ray RLlib, you use the full qualifying name of the class
from where you’re referencing it, so in our case that’s maze_gym_env.GymEnvironment
.
If you had a more complicated Python project and your environment was stored in
another module, you’d simply add the module name accordingly.
The following Python file specifies the minimal configuration needed to train an RLlib
algorithm on the GymEnvironment
class.
To align as closely as possible with our experiment from Chapter 3, in which we
used Q-Learning, we use a DQNConfig
to define a DQN algorithm and store it in a
file called maze.py:
from
ray.rllib.algorithms.dqn
import
DQNConfig
config
=
DQNConfig
()
.
environment
(
"maze_gym_env.GymEnvironment"
)
\.
rollouts
(
num_rollout_workers
=
2
)
This gives a quick preview of RLlib’s Python API, which we cover in the next section. To run this with RLlib, we’re using the rllib train
command.
We do this by specifying the file
we want to run: maze.py.
To make sure we can control the time of training, we tell our algorithm to stop
after running for a total of 10,000 time steps (timesteps_total
):
rllib
train
file
maze.py
--stop
'{"timesteps_total": 10000}'
This single line takes care of everything we did in Chapter 3, but in a better way:
-
It runs a more sophisticated version of Q-Learning for us (DQN).5
-
It takes care of scaling out to multiple workers under the hood (in this case two).
-
It even creates checkpoints of the algorithm automatically for us.
From the output of that training script you should see that Ray will write training
results to a directory located at ~/ray_results/maze_env
.
And if the training run finishes successfully,6 you’ll get a checkpoint and a copiable
rllib evaluate
command in the output, just as in the example from Chapter 1.
Using this reported <checkpoint>
, you can now evaluate the trained policy on our
custom environment by running the following command:
rllibevaluate
~/ray_results/maze_env/<checkpoint>
\
--algo
DQN
\
--env
maze_gym_env.Environment
\
--steps
100
The algorithm used in --algo
and the environment specified with --env
have to
match the ones used in the training run, and we evaluate the trained algorithm for a
total of 100 steps.
This should lead to output of the following form:
Episode #1: reward: 1.0 Episode #2: reward: 1.0 Episode #3: reward: 1.0 ... Episode #13: reward: 1.0
It should not come as a big surprise that the DQN algorithm from RLlib gets the maximum reward of 1 for the simple maze environment we tasked it with every single time.
Before moving on to the Python API, we should mention that the RLlib CLI uses Ray Tune under the hood, for instance, to create the checkpoints of your algorithms. You will learn more about this integration in Chapter 5.
Using the RLlib Python API
In the end, the RLlib CLI is merely a wrapper around its underlying Python library. As you will likely spend most of your time coding your RL experiments in Python, we’ll focus the rest of this chapter on aspects of this API.
To run RL workloads with RLlib from Python, the Algorithm
class is your main entry point.
Always start with a corresponding AlgorithmConfig
class to define an algorithm.
For instance, in the previous section we used a DQNConfig
as a starting point, and
the rllib train
command took care of instantiating the DQN algorithm for us.
All other RLlib algorithms follow the same pattern.
Training RLlib algorithms
Every RLlib Algorithm
comes with reasonable default parameters, meaning that you can
initialize them without having to tweak any configuration parameters for these
algorithms.7
That said, it’s worth noting that RLlib algorithms are highly configurable,
as you will see in the following example.
We start by creating a DQNConfig
object.
Then we specify its environment
and set the number of rollout workers to two by using
the rollouts
method.
This means that the DQN algorithm will spawn two Ray actors, each using a
CPU by default, to run the algorithm in parallel.
Also, for later evaluation purposes, we set create_env_on_local_worker
to True
:
from
ray
.
tune
.
logger
import
pretty_print
from
maze_gym_env
import
GymEnvironment
from
ray
.
rllib
.
algorithms
.
dqn
import
DQNConfig
config
=
(
DQNConfig
(
)
.
environment
(
GymEnvironment
)
.
rollouts
(
num_rollout_workers
=
2
,
create_env_on_local_worker
=
True
)
)
pretty_print
(
config
.
to_dict
(
)
)
algo
=
config
.
build
(
)
for
i
in
range
(
10
)
:
result
=
algo
.
train
(
)
(
pretty_print
(
result
)
)
Set the
environment
to our customGymEnvironment
class and configure the number of rollout workers and ensure that an environment instance is created on the local worker.Use the
DQNConfig
from RLlib tobuild
a DQN algorithm for training. This time we use two rollout workers.Call the
train
method to train the algorithm for 10 iterations.With the
pretty_print
utility, we can generate human-readable output of the training results.
Note that the number of training iterations has no special meaning, but it should be enough for the algorithm to learn to solve the maze problem adequately. The example just goes to show that you have full control over the training process.
From printing the config
dictionary, you can verify that the
num_rollout_workers
parameter is set to 2.8
The result
contains detailed information about the state of the DQN algorithm
and the training results, which are too verbose to show here.
The part that’s most relevant for us right now is information about the reward of the algorithm, which ideally indicates that the algorithm learned to solve the maze problem.
You should see output of the following form (we’re showing only the most relevant
information for clarity):
... episode_reward_max: 1.0 episode_reward_mean: 1.0 episode_reward_min: 1.0 episodes_this_iter: 15 episodes_total: 19 ... training_iteration: 10 ...
In particular, this output shows that the minimum reward attained on average per episode is 1.0, which in turn means that the agent always reached the goal and collected the maximum reward (1.0).
Saving, loading, and evaluating RLlib models
Reaching the goal for this simple example isn’t too difficult, but let’s see if evaluating the trained algorithm confirms that the agent can also do so in an optimal way, namely, by taking only the minimum number of eight steps to reach the goal.
To do so, we utilize another mechanism that you’ve already seen from the RLlib CLI: checkpointing.
Creating algorithm checkpoints is useful to ensure you can recover your work in case
of a crash or simply to track training progress persistently.
You can create a checkpoint of an RLlib algorithm at any point in the training
process by calling algo.save()
.
Once you have a checkpoint, you can easily restore your Algorithm
with it.
Evaluating a model is as simple as calling algo.evaluate(checkpoint)
with the
checkpoint you created.
Here’s how that looks if you put it all together:
from
ray
.
rllib
.
algorithms
.
algorithm
import
Algorithm
checkpoint
=
algo
.
save
(
)
(
checkpoint
)
evaluation
=
algo
.
evaluate
(
)
(
pretty_print
(
evaluation
)
)
algo
.
stop
(
)
restored_algo
=
Algorithm
.
from_checkpoint
(
checkpoint
)
Save algorithms to create checkpoints.
Evaluate RLlib algorithms at any point in time by calling
evaluate
.Stop an
algo
to free all claimed resources.Restore any
Algorithm
from a given checkpoint withfrom_checkpoint
.
Looking at the output of this example, we can now confirm that the trained RLlib algorithm did indeed converge to a good solution for the maze problem, as indicated by episodes of length 8 in evaluation:
~/ray_results/DQN_GymEnvironment_2022-02-09_10-19-301o3m9r6d/checkpoint_000010/ checkpoint-10 evaluation: ... episodes_this_iter: 5 hist_stats: episode_lengths: - 8 - 8 ...
Computing actions
RLlib algorithms have much more functionality than just the train
, evaluate
, save
,
and from_checkpoint
methods we’ve seen so far.
For example, you can directly compute actions given the current state of an environment.
In Chapter 3 we implemented episode rollouts by stepping through an environment
and collecting rewards.
We can easily do the same with RLlib for our GymEnvironment
:
env
=
GymEnvironment
(
)
done
=
False
total_reward
=
0
observations
=
env
.
reset
(
)
while
not
done
:
action
=
algo
.
compute_single_action
(
observations
)
observations
,
reward
,
done
,
info
=
env
.
step
(
action
)
total_reward
+
=
reward
In case you should need to compute many actions at once, not just a single one,
you can use the compute_actions
method instead, which takes dictionaries of
observations as input and produces dictionaries of actions with the same dictionary
keys as output:
action
=
algo
.
compute_actions
(
{
"
obs_1
"
:
observations
,
"
obs_2
"
:
observations
}
)
(
action
)
# {'obs_1': 0, 'obs_2': 1}
Accessing policy and model states
Remember that each reinforcement learning algorithm is based on a policy that chooses next actions given the agent’s current observations of the environment. Each policy is in turn based on an underlying model.
In the case of vanilla Q-Learning that we discussed in Chapter 3, the model was a simple lookup table of state-action values, also called Q-values. And that policy used this model for predicting next actions in case it decided to exploit what the model had learned so far or to explore the environment with random actions otherwise.
When using Deep Q-Learning, the underlying model of the policy is a neural network that, loosely speaking, maps observations to actions. Note that for choosing next actions in an environment, we’re ultimately not interested in the concrete values of the approximated Q-values, but rather in the probabilities of taking each action. The probability distribution over all possible actions is called an action distribution. In the maze we’re using as a running example, we can move up, right, down, or left. So, in our case an action distribution is a vector of four probabilities, one for each action. In the case of Q-Learning, the algorithm will always greedily choose the action with the highest probability of this distribution, while other algorithms will sample from it.
To make things concrete, let’s look at how you access policies and models in RLlib:9
policy
=
algo
.
get_policy
()
(
policy
.
get_weights
())
model
=
policy
.
model
Both policy
and model
have many useful methods to explore.
In this example we use get_weights
to inspect the parameters of the model
underlying the policy (which are called weights by standard convention).
To convince you that not just one model is at play here
but in fact a collection of models,10 we can access all the workers we used in training and then ask each worker’s policy
for their weights using foreach_worker
:
workers
=
algo
.
workers
workers
.
foreach_worker
(
lambda
remote_trainer
:
remote_trainer
.
get_policy
()
.
get_weights
()
)
In this way, you can access every method available on an Algorithm
instance on each
of your workers.
In principle, you can use this to set model parameters as well, or otherwise
configure your workers.
RLlib workers are ultimately Ray actors, so you can alter and manipulate them in almost
any way you like.
We haven’t talked about the specific implementation of Deep Q-Learning used in DQN,
but the model used is a bit more complex than what we’ve described so far.
Every RLlib model obtained from a policy has a base_model
that has a neat summary
method
to describe itself:11
model
.
base_model
.
summary
()
As you can see from the following output, this model takes in our observations
.
The shape of these observations
is a bit strangely annotated as [(None, 25)]
, but essentially this means we have the expected 5 × 5 maze grid values correctly encoded.
The model follows with two so-called Dense
layers and predicts a single value
at the end:12
Model: "model" ________________________________________________________________________________ Layer (type) Output Shape Param # Connected to ================================================================================ observations (InputLayer) [(None, 25)] 0 ________________________________________________________________________________ fc_1 (Dense) (None, 256) 6656 observations[0][0] ________________________________________________________________________________ fc_out (Dense) (None, 256) 65792 fc_1[0][0] ________________________________________________________________________________ value_out (Dense) (None, 1) 257 fc_1[0][0] ================================================================================ Total params: 72,705 Trainable params: 72,705 Non-trainable params: 0 ________________________________________________________________________________
Note that it’s perfectly possible to customize this model for your RLlib experiments. If your environment is complex and has a big observation space, for instance, you might need a bigger model to capture that complexity. However, doing so requires in-depth knowledge of the underlying neural network framework (in this case TensorFlow), which we don’t assume you have.13
Next, let’s see if we can take some observations from our environment and pass them to
the model we just extracted from our policy
.
This part is a bit technically involved because models are a bit more difficult to
access directly in RLlib.
Normally you would only interface with a model through your policy
,
which takes care of preprocessing the observations, among other things. Luckily, we can access the preprocessor used by the policy, transform
the
observations from our environment, and then pass them to the model:
from
ray
.
rllib
.
models
.
preprocessors
import
get_preprocessor
env
=
GymEnvironment
(
)
obs_space
=
env
.
observation_space
preprocessor
=
get_preprocessor
(
obs_space
)
(
obs_space
)
observations
=
env
.
reset
(
)
transformed
=
preprocessor
.
transform
(
observations
)
.
reshape
(
1
,
-
1
)
model_output
,
_
=
model
(
{
"
obs
"
:
transformed
}
)
Use
get_preprocessor
to access the preprocessor used by the policy.You can use
transform
on anyobservations
obtained from yourenv
to the format expected by the model. Note that we need to reshape the observations too.Get the model output by calling the model on a preprocessed observation dictionary.
Having computed our model_output
, we can now access the Q-values and the action distribution of the model for this output:
q_values
=
model
.
get_q_value_distributions
(
model_output
)
(
q_values
)
action_distribution
=
policy
.
dist_class
(
model_output
,
model
)
sample
=
action_distribution
.
sample
(
)
(
sample
)
Configuring RLlib Experiments
Now that you’ve seen the basic Python training API of RLlib in an example,
let’s take a step back and discuss in more depth how to configure and run RLlib experiments.
By now you know that to define an Algorithm
, you start with the respective
AlgorithmConfig
and then build
your algorithm from it.
So far we’ve used only the rollout
method of an AlgorithmConfig
to set the number
of rollout workers to two, and set our environment
accordingly.
If you want to alter the behavior of your RLlib training run, chain more utility
methods onto the AlgorithmConfig
instance and then call build
on it at the end.
As RLlib algorithms are fairly complex, they come with many configuration options.
To make things easier, the common properties of algorithms are naturally grouped into
useful categories.14
Each such category comes with its own respective AlgorithmConfig
method:
training()
-
Takes care of all training-related configuration options of your algorithm. The
training
method is the one place that RLlib algorithms differ in their configuration. All the following methods are algorithm-agnostic. environment()
rollouts()
exploration()
resources()
offline_data()
-
Defines options for training with so-called offline data, a topic we cover in “Working with Offline Data”.
multi_agent()
-
Specifies options for training algorithms using multiple agents. We discuss an explicit example of this in the next section.
The algorithm-specific configuration in training()
becomes even more relevant once
you’ve settled on an algorithm and want to squeeze it for performance.
In practice, RLlib provides you with good defaults to get started.
For more details on configuring RLlib experiments, look up configuration arguments in the API reference for RLlib algorithms. But before we move on to examples, you should learn about the most common configuration options in practice.
Resource Configuration
Whether you use Ray RLlib locally or on a cluster, you can specify the resources used for the training process. Here are the most important options to consider. We continue using the DQN algorithm as an example, but this would apply to any other RLlib algorithm as well:
from
ray
.
rllib
.
algorithms
.
dqn
import
DQNConfig
config
=
DQNConfig
(
)
.
resources
(
num_gpus
=
1
,
num_cpus_per_worker
=
2
,
num_gpus_per_worker
=
0
,
)
Specify the number of GPUs to use for training. It’s important to check whether your algorithm of choice supports GPUs first. This value can also be fractional. For example, if using four rollout workers in DQN (
num_rollout_workers=4
), you can setnum_gpus=0.25
to pack all four workers on the same GPU so that all rollout workers benefit from the potential speedup. This affects only the local learner process, not the rollout workers.Set the number of CPUs to use for each rollout worker.
Set the number of GPUs used per worker.
Rollout Worker Configuration
RLlib lets you configure how your rollouts are computed and how to distribute them:
from
ray
.
rllib
.
algorithms
.
dqn
import
DQNConfig
config
=
DQNConfig
(
)
.
rollouts
(
num_rollout_workers
=
4
,
num_envs_per_worker
=
1
,
create_env_on_local_worker
=
True
,
)
You’ve seen this already. It specifies the number of Ray workers to use.
Specify the number of environments to evaluate per worker. This setting allows you to “batch” evaluation of environments. In particular, if your models take a long time to evaluate, grouping environments like this can speed up training.
When
num_rollout_workers
> 0, the driver (“local worker”) does not need an environment. That’s because sampling and evaluation is done by the rollout workers. If you still want an environment on the driver, you can set this option toTrue
.
Environment Configuration
from
ray
.
rllib
.
algorithms
.
dqn
import
DQNConfig
config
=
DQNConfig
(
)
.
environment
(
env
=
"
CartPole-v1
"
,
env_config
=
{
"
my_config
"
:
"
value
"
}
,
observation_space
=
None
,
action_space
=
None
,
render_env
=
True
,
)
Specify the environment you want to use for training. This can be either a string of an environment known to Ray RLlib, such as any Gym environment, or the class name of a custom environment you’ve implemented.15
Optionally specify a dictionary of configuration options for your environment that will be passed to the environment constructor.
You can specify the observation and action spaces of your environment too. If you don’t specify them, they will be inferred from the environment.
False
by default, this property allows you to turn on rendering of the environment, which requires you to implement therender
method of your environment.
Note that we left out many available configuration options for each of the types we listed. On top of that, we can’t touch on aspects here that alter the behavior of the RL training procedure in this introduction (like modifying the underlying model to use). But the good news is that you’ll find all the information you need in the RLlib Training API documentation.
Working with RLlib Environments
So far we’ve introduced you to just Gym environments, but RLlib supports a wide variety of environments. After giving you a quick overview of all available options (see Figure 4-1), we’ll show you two concrete examples of advanced RLlib environments in action.
An Overview of RLlib Environments
All available RLlib environments extend a common BaseEnv
class.
If you want to work with several copies of the same gym.Env
environment,
you can use RLlib’s VectorEnv
wrapper.
Vectorized environments are useful, but they are straightforward generalizations
of what you’ve seen already.
The two other types of environments available in RLlib are more interesting
and deserve more attention.
The first is called MultiAgentEnv
, which allows you to train a model with multiple agents.
Working with multiple agents can be tricky.
That’s because you have to take care to define your agents within your environment
with a suitable interface and account for the fact that each agent might have a
completely different way of interacting with its environment.
What’s more is that agents might interact with each other, and they have to respect each other’s actions. In more advanced settings, there might even be a hierarchy of agents that explicitly depend on each other. In short, running multi-agent RL experiments is difficult, and we’ll see how RLlib handles this in the next example.
The other type of environment we will look at is called ExternalEnv
,
which can be used to connect external simulators to RLlib.
For instance, imagine our simple maze problem from earlier was a simulation of an
actual robot navigating a maze.
It might not be suitable in such scenarios to co-locate the robot (or its simulation,
implemented in a different software stack) with RLlib’s learning agents.
To account for that, RLlib provides you with a simple client-server architecture
for communicating with external simulators, which allows communication over a REST API.
In case you want to work both in a multi-agent and external environment setting,
RLlib offers a MultiAgentExternalEnv
environment that combines both.
Working with Multiple Agents
The basic idea of defining multi-agent environments in RLlib is simple. You first assign each agent an agent ID. Then, whatever you previously defined as a single value in a Gym environment (observations, rewards, etc.), you now define as a dictionary with agent IDs as keys and values per agent. Of course, the details are a little more complicated than that in practice. But once you have defined an environment hosting several agents, you have to define how these agents should learn.
In a single-agent environment there’s one agent and one policy to learn. In a multi-agent environment there are multiple agents that might map to one or several policies. For instance, if you have a group of homogenous agents in your environment, then you could define a single policy for all of them. If they all act the same way, then their behavior can be learned the same way. In contrast, you might have situations with heterogeneous agents in which each of them has to learn a separate policy. Between these two extremes, there’s a spectrum of possibilities, as shown in Figure 4-2.
We continue to use our maze game as a running example for this chapter.
This way you can check for yourself how the interfaces differ in practice.
So, to put the ideas we just outlined into code, let’s define a multi-agent version of the GymEnvironment
class.
Our MultiAgentEnv
class will have precisely two agents, which we encode in a Python
dictionary called agents
, but in principle this works with any number of agents.
We start by initializing and resetting our new environment:
from
ray
.
rllib
.
env
.
multi_agent_env
import
MultiAgentEnv
from
gym
.
spaces
import
Discrete
import
os
class
MultiAgentMaze
(
MultiAgentEnv
)
:
def
__init__
(
self
,
*
args
,
*
*
kwargs
)
:
self
.
action_space
=
Discrete
(
4
)
self
.
observation_space
=
Discrete
(
5
*
5
)
self
.
agents
=
{
1
:
(
4
,
0
)
,
2
:
(
0
,
4
)
}
self
.
goal
=
(
4
,
4
)
self
.
info
=
{
1
:
{
'
obs
'
:
self
.
agents
[
1
]
}
,
2
:
{
'
obs
'
:
self
.
agents
[
2
]
}
}
def
reset
(
self
)
:
self
.
agents
=
{
1
:
(
4
,
0
)
,
2
:
(
0
,
4
)
}
return
{
1
:
self
.
get_observation
(
1
)
,
2
:
self
.
get_observation
(
2
)
}
Action and observation spaces stay exactly the same as before.
We now have two seekers with
(0, 4)
and(4, 0)
starting positions in anagents
dictionary.For the
info
object, we’re using agent IDs as keys.Observations are now per-agent dictionaries.
Notice that we didn’t touch the action and observation spaces at all. That’s because we’re using two essentially identical agents here that can reuse the same spaces. In more complex situations you’d have to account for the fact that the actions and observations might look different for some agents.16
To continue, let’s generalize our helper methods get_observation
, get_reward
,
and is_done
to work with multiple agents.
We do this by passing in an action_id
to their signatures and handling
each agent the same way as before:
def
get_observation
(
self
,
agent_id
):
seeker
=
self
.
agents
[
agent_id
]
return
5
*
seeker
[
0
]
+
seeker
[
1
]
def
get_reward
(
self
,
agent_id
):
return
1
if
self
.
agents
[
agent_id
]
==
self
.
goal
else
0
def
is_done
(
self
,
agent_id
):
return
self
.
agents
[
agent_id
]
==
self
.
goal
Next, to port the step
method to our multi-agent setup, you have to know that
MultiAgentEnv
now expects the action
passed to a step
to be a dictionary
with keys corresponding to the agent IDs, too.
We define a step by looping through all available agents and acting on
their behalf:17
def
step
(
self
,
action
)
:
agent_ids
=
action
.
keys
(
)
for
agent_id
in
agent_ids
:
seeker
=
self
.
agents
[
agent_id
]
if
action
[
agent_id
]
==
0
:
# move down
seeker
=
(
min
(
seeker
[
0
]
+
1
,
4
)
,
seeker
[
1
]
)
elif
action
[
agent_id
]
==
1
:
# move left
seeker
=
(
seeker
[
0
]
,
max
(
seeker
[
1
]
-
1
,
0
)
)
elif
action
[
agent_id
]
==
2
:
# move up
seeker
=
(
max
(
seeker
[
0
]
-
1
,
0
)
,
seeker
[
1
]
)
elif
action
[
agent_id
]
==
3
:
# move right
seeker
=
(
seeker
[
0
]
,
min
(
seeker
[
1
]
+
1
,
4
)
)
else
:
raise
ValueError
(
"
Invalid action
"
)
self
.
agents
[
agent_id
]
=
seeker
observations
=
{
i
:
self
.
get_observation
(
i
)
for
i
in
agent_ids
}
rewards
=
{
i
:
self
.
get_reward
(
i
)
for
i
in
agent_ids
}
done
=
{
i
:
self
.
is_done
(
i
)
for
i
in
agent_ids
}
done
[
"
__all__
"
]
=
all
(
done
.
values
(
)
)
return
observations
,
rewards
,
done
,
self
.
info
Actions in a
step
are now per-agent dictionaries.After applying the correct action for each seeker, set the correct states of all
agents
.observations
,rewards
, anddones
are also dictionaries with agent IDs as keys.Additionally, RLlib needs to know when all agents are done.
The last step is to modify rendering the environment, which we do by denoting each agent by its ID when printing the maze to the screen:
def
render
(
self
,
*
args
,
**
kwargs
):
os
.
system
(
'cls'
if
os
.
name
==
'nt'
else
'clear'
)
grid
=
[[
'| '
for
_
in
range
(
5
)]
+
[
"|
\n
"
]
for
_
in
range
(
5
)]
grid
[
self
.
goal
[
0
]][
self
.
goal
[
1
]]
=
'|G'
grid
[
self
.
agents
[
1
][
0
]][
self
.
agents
[
1
][
1
]]
=
'|1'
grid
[
self
.
agents
[
2
][
0
]][
self
.
agents
[
2
][
1
]]
=
'|2'
grid
[
self
.
agents
[
2
][
0
]][
self
.
agents
[
2
][
1
]]
=
'|2'
(
''
.
join
([
''
.
join
(
grid_row
)
for
grid_row
in
grid
]))
Randomly rolling out an episode until one of the agents reaches the goal can, for instance, be done by the following code:18
import
time
env
=
MultiAgentMaze
()
while
True
:
obs
,
rew
,
done
,
info
=
env
.
step
(
{
1
:
env
.
action_space
.
sample
(),
2
:
env
.
action_space
.
sample
()}
)
time
.
sleep
(
0.1
)
env
.
render
()
if
any
(
done
.
values
()):
break
Note how we have to make sure to pass two random samples by means of a Python
dictionary into the step
method,
and how we check if any of the agents are done
yet.
We use this break
condition for simplicity because it’s highly unlikely that both
seekers find their way to the goal at the same time by chance.
But of course we’d like both agents to complete the maze eventually.
In any case, equipped with our MultiAgentMaze
, training an RLlib Algorithm
works exactly the same way as before:
from
ray.rllib.algorithms.dqn
import
DQNConfig
simple_trainer
=
DQNConfig
()
.
environment
(
env
=
MultiAgentMaze
)
.
build
()
simple_trainer
.
train
()
This covers the simplest case of training a multi-agent reinforcement learning (MARL) problem.
But if you remember what we said earlier, when using multiple agents, there’s
always a mapping between agents and policies.
By not specifying such a mapping, both of our seekers were implicitly assigned to the same policy.
This can be changed by calling the .multi_agent
method on our DQNConfig
and setting
the policies
and policy_mapping_fn
arguments accordingly:
algo
=
DQNConfig
(
)
\
.
environment
(
env
=
MultiAgentMaze
)
\
.
multi_agent
(
policies
=
{
"
policy_1
"
:
(
None
,
env
.
observation_space
,
env
.
action_space
,
{
"
gamma
"
:
0.80
}
)
,
"
policy_2
"
:
(
None
,
env
.
observation_space
,
env
.
action_space
,
{
"
gamma
"
:
0.95
}
)
,
}
,
policy_mapping_fn
=
lambda
agent_id
:
f
"
policy_
{
agent_id
}
"
,
)
.
build
(
)
(
algo
.
train
(
)
)
Define multiple
policies
for our agents, each with a different"gamma"
value.Each agent can then be mapped to a policy with a custom
policy_mapping_fn
.
As you can see, running multi-agent RL experiments is a first-class citizen of RLlib, and there’s a lot more that could be said about it. The support of MARL problems is probably one of RLlib’s strongest features.
Working with Policy Servers and Clients
For the last example in this section, let’s assume our original
GymEnvironment
can be simulated only on a machine that can’t run RLlib,
for instance because it doesn’t have enough resources available.
We can run the environment on a PolicyClient
that can ask a respective server
for suitable next actions to apply to the environment.
The server, in turn, does not know about the environment.
It only knows how to ingest input data from a PolicyClient
, and it is responsible
for running all RL-related code; in particular, it defines an RLlib AlgorithmConfig
object and trains an Algorithm
.
Typically, you want to run the server that trains your algorithm on a powerful Ray Cluster, and then the respective client runs outside that cluster. Figure 4-3 schematically illustrates this setup.
Defining a server
Let’s start by defining the server side of such an application first.
We define a so-called PolicyServerInput
that runs on localhost on port 9900.
This policy input is what the client will provide later.
With this policy_input
defined as input
to our algorithm configuration,
we can define yet another DQN to run on the server:
# policy_server.py
import
ray
from
ray
.
rllib
.
agents
.
dqn
import
DQNConfig
from
ray
.
rllib
.
env
.
policy_server_input
import
PolicyServerInput
import
gym
ray
.
init
(
)
def
policy_input
(
context
)
:
return
PolicyServerInput
(
context
,
"
localhost
"
,
9900
)
config
=
DQNConfig
(
)
\
.
environment
(
env
=
None
,
action_space
=
gym
.
spaces
.
Discrete
(
4
)
,
observation_space
=
gym
.
spaces
.
Discrete
(
5
*
5
)
)
\
.
debugging
(
log_level
=
"
INFO
"
)
\
.
rollouts
(
num_rollout_workers
=
0
)
\
.
offline_data
(
input
=
policy_input
,
input_evaluation
=
[
]
)
algo
=
config
.
build
(
)
The
policy_input
function returns aPolicyServerInput
object running on localhost on port 9900.We explicitly set the
env
toNone
because this server does not need one.We therefore need to define both an
observation_space
and anaction_space
, as the server is not able to infer them from the environment.To make this work, we need to feed our
policy_input
into the experiment’sinput
.
With this algo
defined,19 we can now start a training session on the server like so:
# policy_server.py
if
__name__
==
"
__main__
"
:
time_steps
=
0
for
_
in
range
(
100
)
:
results
=
algo
.
train
(
)
checkpoint
=
algo
.
save
(
)
if
time_steps
>
=
1000
:
break
time_steps
+
=
results
[
"
timesteps_total
"
]
Train for a maximum of 100 iterations and store checkpoints after each iteration.
If training surpasses 1,000 time steps, we stop the training.
In what follows we assume that you store the last two code snippets in a
file called policy_server.py.
If you want to, you can now start this policy server on your local machine by
running python policy_server.py
in a terminal.
Defining a client
Next, to define the corresponding client side of the application,
we define a PolicyClient
that connects to the server we just started.
Since we can’t assume that you have several computers at home
(or available in the cloud), contrary to what we said prior,
we will start this client on the same machine.
In other words, the client will connect to http://localhost:9900
,
but if you can run the server on a different machine, you could replace localhost
with the IP address of that machine, provided it’s available in the network.
Policy clients have a fairly lean interface. They can trigger the server to start or end an episode, get next actions from it, and log reward information to it (that it would otherwise not have). With that said, here’s how you define such a client:
# policy_client.py
import
gym
from
ray
.
rllib
.
env
.
policy_client
import
PolicyClient
from
maze_gym_env
import
GymEnvironment
if
__name__
==
"
__main__
"
:
env
=
GymEnvironment
(
)
client
=
PolicyClient
(
"
http://localhost:9900
"
,
inference_mode
=
"
remote
"
)
obs
=
env
.
reset
(
)
episode_id
=
client
.
start_episode
(
training_enabled
=
True
)
while
True
:
action
=
client
.
get_action
(
episode_id
,
obs
)
obs
,
reward
,
done
,
info
=
env
.
step
(
action
)
client
.
log_returns
(
episode_id
,
reward
,
info
=
info
)
if
done
:
client
.
end_episode
(
episode_id
,
obs
)
exit
(
0
)
Start a policy client on the server address with
remote
inference mode.Tell the server to start an episode.
For given environment observations, we can get the next action from the server.
It’s mandatory for the
client
to log reward information to the server.If a certain condition is reached, we can stop the client process.
If the environment is
done
, we have to inform the server about episode completion.
Assuming you store this code under policy_client.py and start it by running
python policy_client.py
, then the server that we started earlier will start
learning with environment information solely obtained from the client.
Advanced Concepts
So far we’ve been working with simple environments that were easy enough to tackle with the most basic RL algorithm settings in RLlib. Of course, in practice you’re not always that lucky and might have to come up with other ideas to tackle more difficult environments. In this section we’re going to introduce a slightly harder version of the maze environment and discuss some advanced concepts to help you solve it.
Building an Advanced Environment
Let’s make our maze GymEnvironment
a bit more challenging.
First, we increase its size from a 5 × 5 to an 11 × 11 grid.
Then we introduce obstacles in the maze that the agent can pass through
but only by incurring a penalty, a negative reward of –1.
This way our seeker agent will have to learn to avoid obstacles
while still finding the goal.
Also, we randomize the agent’s starting position.
All of this makes the RL problem harder to solve.
Let’s look at the initialization of this new AdvancedEnv
first:
from
gym
.
spaces
import
Discrete
import
random
import
os
class
AdvancedEnv
(
GymEnvironment
)
:
def
__init__
(
self
,
seeker
=
None
,
*
args
,
*
*
kwargs
)
:
super
(
)
.
__init__
(
*
args
,
*
*
kwargs
)
self
.
maze_len
=
11
self
.
action_space
=
Discrete
(
4
)
self
.
observation_space
=
Discrete
(
self
.
maze_len
*
self
.
maze_len
)
if
seeker
:
assert
0
<
=
seeker
[
0
]
<
self
.
maze_len
and
\
0
<
=
seeker
[
1
]
<
self
.
maze_len
self
.
seeker
=
seeker
else
:
self
.
reset
(
)
self
.
goal
=
(
self
.
maze_len
-
1
,
self
.
maze_len
-
1
)
self
.
info
=
{
'
seeker
'
:
self
.
seeker
,
'
goal
'
:
self
.
goal
}
self
.
punish_states
=
[
(
i
,
j
)
for
i
in
range
(
self
.
maze_len
)
for
j
in
range
(
self
.
maze_len
)
if
i
%
2
==
1
and
j
%
2
==
0
]
Next, when resetting the environment, we want to make sure to reset the agent’s position to a random state.20 We also increase the positive reward for reaching the goal to 5 to offset the negative reward for passing through an obstacle (which will happen a lot before the RL algorithm picks up on the obstacle locations). Balancing rewards like this is a crucial task in calibrating your RL experiments:
def
reset
(
self
):
"""Reset seeker position randomly, return observations."""
self
.
seeker
=
(
random
.
randint
(
0
,
self
.
maze_len
-
1
),
random
.
randint
(
0
,
self
.
maze_len
-
1
)
)
return
self
.
get_observation
()
def
get_observation
(
self
):
"""Encode the seeker position as integer"""
return
self
.
maze_len
*
self
.
seeker
[
0
]
+
self
.
seeker
[
1
]
def
get_reward
(
self
):
"""Reward finding the goal and punish forbidden states"""
reward
=
-
1
if
self
.
seeker
in
self
.
punish_states
else
0
reward
+=
5
if
self
.
seeker
==
self
.
goal
else
0
return
reward
def
render
(
self
,
*
args
,
**
kwargs
):
"""Render the environment, e.g. by printing its representation."""
os
.
system
(
'cls'
if
os
.
name
==
'nt'
else
'clear'
)
grid
=
[[
'| '
for
_
in
range
(
self
.
maze_len
)]
+
[
"|
\n
"
]
for
_
in
range
(
self
.
maze_len
)]
for
punish
in
self
.
punish_states
:
grid
[
punish
[
0
]][
punish
[
1
]]
=
'|X'
grid
[
self
.
goal
[
0
]][
self
.
goal
[
1
]]
=
'|G'
grid
[
self
.
seeker
[
0
]][
self
.
seeker
[
1
]]
=
'|S'
(
''
.
join
([
''
.
join
(
grid_row
)
for
grid_row
in
grid
]))
There are many other ways you could make this environment more difficult, like making it much bigger, introducing a negative reward for every step the agent takes in a certain direction, or punishing the agent for trying to walk off the grid. By now you should understand the problem setting well enough to customize the maze further.
While you might have success training this environment, this is a good opportunity to introduce some advanced concepts that you can apply to other RL problems.
Applying Curriculum Learning
One of the most interesting features of RLlib is providing an Algorithm
with a curriculum to learn from.
Instead of letting the algorithm learn from arbitrary environment setups,
we cherry-pick states that are much easier to learn from and then slowly but surely
introduce more difficult states.
Building a learning curriculum is a great way to make your experiments
converge to solutions quicker.
To apply curriculum learning, the only thing you need is a view on which
starting states are easier than others.
This can be a challenge for many environments, but it’s easy to come
up with a simple curriculum for our advanced maze.
Namely, the distance of the seeker from the goal can be used as a measure of difficulty.
The distance measure we’ll use for simplicity is the sum of the absolute distance
of both seeker coordinates from the goal to define a difficulty
.
To run curriculum learning with RLlib, we define a CurriculumEnv
that extends
both our AdvancedEnv
and a so-called TaskSettableEnv
from RLLib.
The interface of TaskSettableEnv
is very simple in that you have to define only how
to get the current difficulty (get_task
) and how to set a required difficulty (set_task
).
Here’s the full definition of this CurriculumEnv
:
from
ray
.
rllib
.
env
.
apis
.
task_settable_env
import
TaskSettableEnv
class
CurriculumEnv
(
AdvancedEnv
,
TaskSettableEnv
)
:
def
__init__
(
self
,
*
args
,
*
*
kwargs
)
:
AdvancedEnv
.
__init__
(
self
)
def
difficulty
(
self
)
:
return
abs
(
self
.
seeker
[
0
]
-
self
.
goal
[
0
]
)
+
\
abs
(
self
.
seeker
[
1
]
-
self
.
goal
[
1
]
)
def
get_task
(
self
)
:
return
self
.
difficulty
(
)
def
set_task
(
self
,
task_difficulty
)
:
while
not
self
.
difficulty
(
)
<
=
task_difficulty
:
self
.
reset
(
)
Define the
difficulty
of the current state as the sum of the absolute distance of both seeker coordinates from the goal.To define
get_task
we can then simply return the currentdifficulty
.To set a task difficulty, we
reset
the environment until itsdifficulty
is at most the specifiedtask_difficulty
.
To use this environment for curriculum learning, we need to define a curriculum function that tells the algorithm when and how to set the task difficulty. We have many options here, but we use a schedule that simply increases the difficulty by one every 1,000 time steps trained:
def
curriculum_fn
(
train_results
,
task_settable_env
,
env_ctx
):
time_steps
=
train_results
.
get
(
"timesteps_total"
)
difficulty
=
time_steps
//
1000
(
f
"Current difficulty:
{
difficulty
}
"
)
return
difficulty
To test this curriculum function, we need to add it to our RLlib algorithm config
by setting the env_task_fn
property to our curriculum_fn
.
Note that before training a DQN for 15 iterations, we also set an output folder in our config.
This will store experience data of our training run to the specified temp folder:21
from
ray
.
rllib
.
algorithms
.
dqn
import
DQNConfig
import
tempfile
temp
=
tempfile
.
mkdtemp
(
)
trainer
=
(
DQNConfig
(
)
.
environment
(
env
=
CurriculumEnv
,
env_task_fn
=
curriculum_fn
)
.
offline_data
(
output
=
temp
)
.
build
(
)
)
for
i
in
range
(
15
)
:
trainer
.
train
(
)
Create a temp file to store our training data for later use.
Set the
CurriculumEnv
as our environment in theenvironment
part of our config and assign ourcurriculum_fn
to theenv_task_fn
property.Use the
offline_data
method to storeoutput
in our temp folder.
Running this algorithm, you should see how the task difficulty increases over time, thereby giving the algorithm easy examples to start with so that it can learn from them and progress to more difficult tasks.
Curriculum learning is a great technique to be aware of and RLlib allows you to easily incorporate it into your experiments through the curriculum API we just discussed.
Working with Offline Data
In our previous curriculum learning example we stored training data to a temporary folder. What’s interesting is that you already know from Chapter 3 that in Q-Learning you can collect experience data first and decide when to use it in a training step later. This separation of data collection and training opens up many possibilities. For instance, maybe you have a good heuristic that can solve your problem in an imperfect yet reasonable manner. Or you have records of human interaction with your environment, demonstrating how to solve the problem by example.
The topic of collecting experience data for later training is often discussed as working with offline data. It’s called “offline” because it’s not directly generated by a policy interacting online with the environment. Algorithms that don’t rely on training on their own policy output are called off-policy algorithms, and Q-Learning, particularly DQN, is just one such example. Algorithms that don’t share this property are called on-policy algorithms. In other words, off-policy algorithms can be used to train on offline data.22
To use the data we stored in the temp folder, we can create a new DQNConfig
that takes this folder as input
.
We will also set explore
to False
, since we simply want to exploit the data
previously collected for training—the algorithm will not explore according to its own policy.
Using the resulting RLlib algorithm works exactly as before, which we demonstrate by training it for 10 iterations and then evaluating it:
imitation_algo
=
(
DQNConfig
()
.
environment
(
env
=
AdvancedEnv
)
.
evaluation
(
off_policy_estimation_methods
=
{})
.
offline_data
(
input_
=
temp
)
.
exploration
(
explore
=
False
)
.
build
())
for
i
in
range
(
10
):
imitation_algo
.
train
()
imitation_algo
.
evaluate
()
Note that we called the algorithm imitation_algo
.
That’s because this training procedure intends to imitate the behavior reflected
in the data we collected before.
This type of learning by demonstration in RL is therefore often called
imitation learning or behavior cloning.
Other Advanced Topics
Before concluding this chapter, let’s have a look at a few other advanced topics that RLlib has to offer. You’ve already seen how flexible RLlib is: working with a range of different environments, configuring your experiments, training on a curriculum, or running imitation learning. This section gives you a taste of what else is possible.
With RLlib, you can completely customize the models and policies used under the hood. If you’ve worked with deep learning before, you know how important it can be to have a good model architecture in place. In RL this is often not as crucial as in supervised learning, but it is still a vital part of successfully running advanced experiments.
You can also change the way your observations are preprocessed by providing custom preprocessors. For our simple maze examples, there was nothing to preprocess, but when working with image or video data, preprocessing is often a crucial step.
In our AdvancedEnv
we introduced states to avoid. Our agents had to learn to do this, but RLlib has a feature to automatically avoid them through so-called parametric action spaces. Loosely speaking, what you can do is “mask out” all undesired actions from the action space for each point in time. In some cases it can also be necessary to have variable observation spaces, which is also fully supported by RLlib.
We briefly touched on the topic of offline data. RLlib has a full-fledged Python API for reading and writing experience data that can be used in various situations.
We have worked solely with DQN here for simplicity, but RLlib has an impressive range of training algorithms. To name just one, the MARWIL algorithm is a complex hybrid algorithm with which you can run imitation learning from offline data, while also mixing in regular training on data generated “online.”
Summary
You’ve seen a selection of interesting RLlib features in this chapter. We covered training multi-agent environments, working with offline data generated by another agent, setting up a client-server architecture to split simulations from RL training, and using curriculum learning to specify increasingly difficult tasks.
We’ve also given you a quick overview of the main concepts underlying RLlib and how to use its CLI and Python API. In particular, we’ve shown how to configure your RLlib algorithms and environments to your needs. As we’ve covered only a small part of RLlib’s possibilities, we encourage you to read its documentation and explore its API.
In the next chapter you’ll learn how to tune the hyperparameters of your RLlib models and policies with Ray Tune.
1 We’re using a simple game to illustrate the process of RL. There is a multitude of interesting industry applications of RL that are not games.
2 We don’t cover this integration in this book, but you can learn more about deploying RLlib models in the “Serving RLlib Models” tutorial in the Ray documentation.
3 From Ray 2.3.0 onward, RLlib will be using the Gymnasium library as drop-in replacement for Gym. This will likely introduce some breaking changes, so it’s best to stick with Ray 2.2.0 to follow this chapter.
4 Gym comes with a variety of interesting environments that are worth exploring. For instance, you can find many of the Atari environments that were used in the famous “Playing Atari with Deep Reinforcement Learning” paper from DeepMind, or advanced physics simulations using the MuJoCo engine.
5 To be precise, RLlib uses a double and dueling DQN.
6 In the GitHub repo for this book we’ve also included an equivalent maze.yml file that you could use via rllib train file maze.yml
(no --type
needed).
7 Of course, configuring your models is a crucial part of RL experiments. We will discuss configuration of RLlib algorithms in more detail in the next section.
8 If you set num_rollout_workers
to 0, only the local worker on the head node will be created, and all sampling from the env
is done there. This is particularly useful for debugging, as no additional Ray actor processes are spawned.
9 The Policy
class in RLlib today will be replaced in a future release. The new Policy
class will likely be a drop-in replacement for the most part and exhibit some minor differences. The idea of the class remains the same, though: a policy is a class that encapsulates the logic of choosing actions given observations, and it gives you access to the underlying models used.
10 Technically speaking, only the local model is used for actual training. The two worker models are used for action computation and data collection (rollouts). After each training step, the local model sends its current weights to the workers for synchronization. Fully distributed training, as opposed to distributed sampling, will be available across all RLlib algorithms in future Ray versions.
11 This is true by default, since we’re using TensorFlow and Keras under the hood. Should you opt to change the framework
specification of your algorithm to work with PyTorch directly, do print(model)
, in which case model
is-a torch.nn.Module
. Access to the underlying model will be unified across all frameworks the future.
12 The “value” output of this network represents the Q-value of state-action pairs.
13 To learn more about customizing your RLlib models, check out the guide to custom models in the Ray documentation.
14 We list only the methods we introduce in this chapter. Apart from those we mention, you also find options for evaluation
of your algorithms, reporting
, debugging
, checkpointing
, adding callbacks
, altering your deep learning framework
, requesting resources
, and accessing experimental
features.
15 There’s also a way to register your environments so that you can refer to them by name, but this requires using Ray Tune. You will learn about this feature in Chapter 5.
16 You can find a good example that defines different observation and action spaces for multiple agents in the RLlib documentation.
17 Note how this can lead to issues like deciding which agent gets to act first. In our simple maze problem the order of actions is irrelevant, but in more complex scenarios this becomes a crucial part of modeling the RL problem correctly.
18 Deciding when an episode is done is a crucial part of multi-agent RL, and it depends entirely on the problem at hand and what you want to achieve.
19 For technical reasons, we have to specify observation and action spaces here, which might not be necessary in future releases of RLlib, as it leaks environment information. Also note that we need to set input_evaluation
to an empty list to make this server work.
20 In the definition of reset
, we allow the seeker to reset on top of the goal to keep the definition simpler. Allowing this trivial edge case does not affect learning.
21 Note that if you run the notebook for this chapter on the cloud, the training process could take a while to finish.
22 Note that RLlib has a wide range of on-policy algorithms like PPO as well.
Get Learning Ray now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.