Chapter 4. Reinforcement Learning with Ray RLlib

In Chapter 3 you built an RL environment, a simulation to play out some games, an RL algorithm, and the code to parallelize the training of the algorithm—all completely from scratch. It’s good to know how to do all that, but in practice the only thing you really want to do when training RL algorithms is the first part, namely, specifying your custom environment, the “game” you want to play.1 Most of your efforts will go into selecting the right algorithm, setting it up, finding the best parameters for the problem, and generally focusing on training a well-performing policy.

Ray RLlib is an industry-grade library for building RL algorithms at scale. You’ve already seen a first example of RLlib in Chapter 1, but in this chapter we’ll go into much more depth. The great thing about RLlib is that it’s a mature library for developers that comes with good abstractions to work with. As you will see, many of these abstractions you already know from the previous chapter.

We start out by giving you an overview of RLlib’s capabilities. Then we quickly revisit the maze game from Chapter 3 and show you how to tackle it both with the RLlib CLI and the RLlib Python API in a few lines of code. You’ll see how easy RLlib is to get started before learning about its key concepts, such as RLlib environments and algorithms.

We’ll also take a closer look at some advanced RL topics that are extremely useful in practice but are not often properly supported in other RL libraries. For instance, you will learn how to create a curriculum for your RL agents so that they can learn simple scenarios before moving on to more complex ones. You will also see how RLlib deals with having multiple agents in a single environment and how to leverage experience data that you’ve collected outside your current application to improve your agent’s performance.

An Overview of RLlib

Before we dive into any examples, let’s quickly discuss what RLlib is and what it can do. As part of the Ray ecosystem, RLlib inherits all the performance and scalability benefits of Ray. In particular, RLlib is distributed by default, so you can scale your RL training to as many nodes as you want.

Another benefit of being built on top of Ray is that RLlib integrates tightly with other Ray libraries. For instance, the hyperparameters of any RLlib algorithm can be tuned with Ray Tune, as we will see in Chapter 5. You can also seamlessly deploy your RLlib models with Ray Serve.2

What’s extremely useful is that RLlib works with both of the predominant deep learning frameworks at the time of this writing: PyTorch and TensorFlow. You can use either one of them as your backend and can easily switch between them, often by changing just one line of code. That’s a huge benefit, as companies are often locked into their underlying deep learning framework and can’t afford to switch to another system and rewrite their code.

RLlib also has a track record of solving real-world problems and is a mature library used by many companies to bring their RL workloads to production. The RLlib API appeals to many engineers, as it offers the right level of abstraction for many applications while still being flexible enough to be extended.

Apart from these more general benefits, RLlib has a lot of RL-specific features that we will cover in this chapter. In fact, RLlib is so feature rich that it would deserve a book on its own, which means we can touch on just some aspects of it here. For instance, RLlib has a rich library of advanced RL algorithms to choose from. In this chapter we will focus on a few select ones, but you can track the growing list of options on the RLlib algorithms page. RLlib also has many options for specifying RL environments and is very flexible in handling them during training; for an overview of RLlib environments see the documentation.

Getting Started with RLlib

To use RLlib, make sure you have installed it on your computer:

pip install "ray[rllib]==2.2.0"
Note

Check out the accompanying notebook for this chapter if you don’t feel like typing the code yourself.

Every RL problem starts with having an interesting environment to investigate. In Chapter 1 we looked at the classical cart–pole balancing problem. Recall that we didn’t implement this cart–pole environment; it came out of the box with RLlib.

In contrast, in Chapter 3 we implemented a simple maze game on our own. The problem with this implementation is that we can’t directly use it with RLlib or any other RL library for that matter. The reason is that in RL you have ubiquitous standards, and our environments need to implement certain interfaces. The best known and most widely used library for RL environments is gym, an open source Python project from OpenAI.

Let’s have a look at what Gym is and how to convert our maze Environment from the previous chapter to a Gym environment compatible with RLlib.

Building a Gym Environment

If you look at the well-documented and easy-to-read gym.Env environment interface on GitHub, you’ll notice that an implementation of this interface has two mandatory class variables and three methods that subclasses need to implement. You don’t have to check the source code, but we do encourage you to have a look. You might just be surprised by how much you already know about these environments.

In short, the interface of a Gym environment looks like the following pseudocode:

import gym


class Env:

    action_space: gym.spaces.Space
    observation_space: gym.spaces.Space  1

    def step(self, action):  2
        ...

    def reset(self):  3
        ...

    def render(self, mode="human"):  4
        ...
1

The gym.Env interface has an action and an observation space.

2

The Env can run a step and returns a tuple of observations, reward, done condition, and further info.

3

An Env can reset itself, which will return the initial observations of a new episode.

4

We can render an Env for different purposes, such as for human display or as a string representation.

You’ll recall from Chapter 3 that this is very similar to the interface of the maze Environment we built there. In fact, Gym has a so-called Discrete space implemented in gym.spaces, which means we can make our maze Environment a gym.Env as follows. We assume that you store this code in a file called maze_gym_env.py and that the code for the Environment from Chapter 3 is located at the top of that file (or is imported there):

# maze_gym_env.py  | Original definition of Environment goes at the top.

import gym
from gym.spaces import Discrete  1


class GymEnvironment(Environment, gym.Env):  2
    def __init__(self, *args, **kwargs):
        """Make our original Environment a gym `Env`."""
        super().__init__(*args, **kwargs)


gym_env = GymEnvironment()
1

Replace our own Discrete implementation with that of Gym.

2

Make the GymEnvironment implement a gym.Env. The interface is essentially the same as before.

Of course, we could have made our original Environment implement gym.Env by simply inheriting from it in the first place. But the point is that the gym.Env interface comes up so naturally in the context of RL that it is a good exercise to implement it without having to resort to external libraries.3

The gym.Env interface also comes with helpful utility functionality and many interesting example implementations. For instance, the CartPole-v1 environment we used in Chapter 1 is an example from Gym,4 and there are many other environments available to test your RL algorithms.

Running the RLlib CLI

Now that we have our GymEnvironment implemented as a gym.Env, here’s how you can use it with RLlib. You’ve seen the RLlib CLI in action in Chapter 1, but this time the situation is a bit different. In the first chapter we simply ran a tuned example using the rllib example command.

This time around we want to bring our own gym environment class, namely, the class GymEnvironment that we defined in maze_gym_env.py. To specify this class in Ray RLlib, you use the full qualifying name of the class from where you’re referencing it, so in our case that’s maze_gym_env.GymEnvironment. If you had a more complicated Python project and your environment was stored in another module, you’d simply add the module name accordingly.

The following Python file specifies the minimal configuration needed to train an RLlib algorithm on the GymEnvironment class. To align as closely as possible with our experiment from Chapter 3, in which we used Q-Learning, we use a DQNConfig to define a DQN algorithm and store it in a file called maze.py:

from ray.rllib.algorithms.dqn import DQNConfig

config = DQNConfig().environment("maze_gym_env.GymEnvironment")\
    .rollouts(num_rollout_workers=2)

This gives a quick preview of RLlib’s Python API, which we cover in the next section. To run this with RLlib, we’re using the rllib train command. We do this by specifying the file we want to run: maze.py. To make sure we can control the time of training, we tell our algorithm to stop after running for a total of 10,000 time steps (timesteps_total):

 rllib train file maze.py --stop '{"timesteps_total": 10000}'

This single line takes care of everything we did in Chapter 3, but in a better way:

  • It runs a more sophisticated version of Q-Learning for us (DQN).5

  • It takes care of scaling out to multiple workers under the hood (in this case two).

  • It even creates checkpoints of the algorithm automatically for us.

From the output of that training script you should see that Ray will write training results to a directory located at ~/ray_results/maze_env. And if the training run finishes successfully,6 you’ll get a checkpoint and a copiable rllib evaluate command in the output, just as in the example from Chapter 1. Using this reported <checkpoint>, you can now evaluate the trained policy on our custom environment by running the following command:

rllib evaluate ~/ray_results/maze_env/<checkpoint>\
  --algo DQN\
  --env maze_gym_env.Environment\
  --steps 100

The algorithm used in --algo and the environment specified with --env have to match the ones used in the training run, and we evaluate the trained algorithm for a total of 100 steps. This should lead to output of the following form:

Episode #1: reward: 1.0
Episode #2: reward: 1.0
Episode #3: reward: 1.0
...
Episode #13: reward: 1.0

It should not come as a big surprise that the DQN algorithm from RLlib gets the maximum reward of 1 for the simple maze environment we tasked it with every single time.

Before moving on to the Python API, we should mention that the RLlib CLI uses Ray Tune under the hood, for instance, to create the checkpoints of your algorithms. You will learn more about this integration in Chapter 5.

Using the RLlib Python API

In the end, the RLlib CLI is merely a wrapper around its underlying Python library. As you will likely spend most of your time coding your RL experiments in Python, we’ll focus the rest of this chapter on aspects of this API.

To run RL workloads with RLlib from Python, the Algorithm class is your main entry point. Always start with a corresponding AlgorithmConfig class to define an algorithm. For instance, in the previous section we used a DQNConfig as a starting point, and the rllib train command took care of instantiating the DQN algorithm for us. All other RLlib algorithms follow the same pattern.

Training RLlib algorithms

Every RLlib Algorithm comes with reasonable default parameters, meaning that you can initialize them without having to tweak any configuration parameters for these algorithms.7

That said, it’s worth noting that RLlib algorithms are highly configurable, as you will see in the following example. We start by creating a DQNConfig object. Then we specify its environment and set the number of rollout workers to two by using the rollouts method. This means that the DQN algorithm will spawn two Ray actors, each using a CPU by default, to run the algorithm in parallel. Also, for later evaluation purposes, we set create_env_on_local_worker to True:

from ray.tune.logger import pretty_print
from maze_gym_env import GymEnvironment
from ray.rllib.algorithms.dqn import DQNConfig

config = (DQNConfig().environment(GymEnvironment)  1
          .rollouts(num_rollout_workers=2, create_env_on_local_worker=True))

pretty_print(config.to_dict())

algo = config.build()  2

for i in range(10):
    result = algo.train()  3

print(pretty_print(result))  4
1

Set the environment to our custom GymEnvironment class and configure the number of rollout workers and ensure that an environment instance is created on the local worker.

2

Use the DQNConfig from RLlib to build a DQN algorithm for training. This time we use two rollout workers.

3

Call the train method to train the algorithm for 10 iterations.

4

With the pretty_print utility, we can generate human-readable output of the training results.

Note that the number of training iterations has no special meaning, but it should be enough for the algorithm to learn to solve the maze problem adequately. The example just goes to show that you have full control over the training process.

From printing the config dictionary, you can verify that the num_rollout_workers parameter is set to 2.8 The result contains detailed information about the state of the DQN algorithm and the training results, which are too verbose to show here. The part that’s most relevant for us right now is information about the reward of the algorithm, which ideally indicates that the algorithm learned to solve the maze problem. You should see output of the following form (we’re showing only the most relevant information for clarity):

...
episode_reward_max: 1.0
episode_reward_mean: 1.0
episode_reward_min: 1.0
episodes_this_iter: 15
episodes_total: 19
...
training_iteration: 10
...

In particular, this output shows that the minimum reward attained on average per episode is 1.0, which in turn means that the agent always reached the goal and collected the maximum reward (1.0).

Saving, loading, and evaluating RLlib models

Reaching the goal for this simple example isn’t too difficult, but let’s see if evaluating the trained algorithm confirms that the agent can also do so in an optimal way, namely, by taking only the minimum number of eight steps to reach the goal.

To do so, we utilize another mechanism that you’ve already seen from the RLlib CLI: checkpointing. Creating algorithm checkpoints is useful to ensure you can recover your work in case of a crash or simply to track training progress persistently. You can create a checkpoint of an RLlib algorithm at any point in the training process by calling algo.save(). Once you have a checkpoint, you can easily restore your Algorithm with it. Evaluating a model is as simple as calling algo.evaluate(checkpoint) with the checkpoint you created. Here’s how that looks if you put it all together:

from ray.rllib.algorithms.algorithm import Algorithm


checkpoint = algo.save()  1
print(checkpoint)

evaluation = algo.evaluate()  2
print(pretty_print(evaluation))

algo.stop()  3
restored_algo = Algorithm.from_checkpoint(checkpoint)  4
1

Save algorithms to create checkpoints.

2

Evaluate RLlib algorithms at any point in time by calling evaluate.

3

Stop an algo to free all claimed resources.

4

Restore any Algorithm from a given checkpoint with from_​check⁠point.

Looking at the output of this example, we can now confirm that the trained RLlib algorithm did indeed converge to a good solution for the maze problem, as indicated by episodes of length 8 in evaluation:

~/ray_results/DQN_GymEnvironment_2022-02-09_10-19-301o3m9r6d/checkpoint_000010/
checkpoint-10 evaluation:
  ...
  episodes_this_iter: 5
  hist_stats:
    episode_lengths:
    - 8
    - 8
    ...

Computing actions

RLlib algorithms have much more functionality than just the train, evaluate, save, and from_checkpoint methods we’ve seen so far. For example, you can directly compute actions given the current state of an environment. In Chapter 3 we implemented episode rollouts by stepping through an environment and collecting rewards. We can easily do the same with RLlib for our GymEnvironment:

env = GymEnvironment()
done = False
total_reward = 0
observations = env.reset()

while not done:
    action = algo.compute_single_action(observations)  1
    observations, reward, done, info = env.step(action)
    total_reward += reward
1

To compute actions for given observations, use compute_single_action.

In case you should need to compute many actions at once, not just a single one, you can use the compute_actions method instead, which takes dictionaries of observations as input and produces dictionaries of actions with the same dictionary keys as output:

action = algo.compute_actions(  1
    {"obs_1": observations, "obs_2": observations}
)
print(action)
# {'obs_1': 0, 'obs_2': 1}
1

For multiple actions, use compute_actions.

Accessing policy and model states

Remember that each reinforcement learning algorithm is based on a policy that chooses next actions given the agent’s current observations of the environment. Each policy is in turn based on an underlying model.

In the case of vanilla Q-Learning that we discussed in Chapter 3, the model was a simple lookup table of state-action values, also called Q-values. And that policy used this model for predicting next actions in case it decided to exploit what the model had learned so far or to explore the environment with random actions otherwise.

When using Deep Q-Learning, the underlying model of the policy is a neural network that, loosely speaking, maps observations to actions. Note that for choosing next actions in an environment, we’re ultimately not interested in the concrete values of the approximated Q-values, but rather in the probabilities of taking each action. The probability distribution over all possible actions is called an action distribution. In the maze we’re using as a running example, we can move up, right, down, or left. So, in our case an action distribution is a vector of four probabilities, one for each action. In the case of Q-Learning, the algorithm will always greedily choose the action with the highest probability of this distribution, while other algorithms will sample from it.

To make things concrete, let’s look at how you access policies and models in RLlib:9

policy = algo.get_policy()
print(policy.get_weights())

model = policy.model

Both policy and model have many useful methods to explore. In this example we use get_weights to inspect the parameters of the model underlying the policy (which are called weights by standard convention).

To convince you that not just one model is at play here but in fact a collection of models,10 we can access all the workers we used in training and then ask each worker’s policy for their weights using foreach_worker:

workers = algo.workers
workers.foreach_worker(
    lambda remote_trainer: remote_trainer.get_policy().get_weights()
)

In this way, you can access every method available on an Algorithm instance on each of your workers. In principle, you can use this to set model parameters as well, or otherwise configure your workers. RLlib workers are ultimately Ray actors, so you can alter and manipulate them in almost any way you like.

We haven’t talked about the specific implementation of Deep Q-Learning used in DQN, but the model used is a bit more complex than what we’ve described so far. Every RLlib model obtained from a policy has a base_model that has a neat summary method to describe itself:11

model.base_model.summary()

As you can see from the following output, this model takes in our observations. The shape of these observations is a bit strangely annotated as [(None, 25)], but essentially this means we have the expected 5 × 5 maze grid values correctly encoded. The model follows with two so-called Dense layers and predicts a single value at the end:12

Model: "model"
________________________________________________________________________________
Layer (type)                  Output Shape       Param #     Connected to
================================================================================
observations (InputLayer)     [(None, 25)]       0
________________________________________________________________________________
fc_1 (Dense)                  (None, 256)        6656        observations[0][0]
________________________________________________________________________________
fc_out (Dense)                (None, 256)        65792       fc_1[0][0]
________________________________________________________________________________
value_out (Dense)             (None, 1)          257         fc_1[0][0]
================================================================================
Total params: 72,705
Trainable params: 72,705
Non-trainable params: 0
________________________________________________________________________________

Note that it’s perfectly possible to customize this model for your RLlib experiments. If your environment is complex and has a big observation space, for instance, you might need a bigger model to capture that complexity. However, doing so requires in-depth knowledge of the underlying neural network framework (in this case TensorFlow), which we don’t assume you have.13

Next, let’s see if we can take some observations from our environment and pass them to the model we just extracted from our policy. This part is a bit technically involved because models are a bit more difficult to access directly in RLlib. Normally you would only interface with a model through your policy, which takes care of preprocessing the observations, among other things. Luckily, we can access the preprocessor used by the policy, transform the observations from our environment, and then pass them to the model:

from ray.rllib.models.preprocessors import get_preprocessor


env = GymEnvironment()
obs_space = env.observation_space
preprocessor = get_preprocessor(obs_space)(obs_space)  1

observations = env.reset()
transformed = preprocessor.transform(observations).reshape(1, -1)  2

model_output, _ = model({"obs": transformed})  3
1

Use get_preprocessor to access the preprocessor used by the policy.

2

You can use transform on any observations obtained from your env to the format expected by the model. Note that we need to reshape the observations too.

3

Get the model output by calling the model on a preprocessed observation dictionary.

Having computed our model_output, we can now access the Q-values and the action distribution of the model for this output:

q_values = model.get_q_value_distributions(model_output)  1
print(q_values)

action_distribution = policy.dist_class(model_output, model)  2
sample = action_distribution.sample()  3
print(sample)
1

The get_q_value_distributions method is specific to DQN models only.

2

By accessing dist_class we get the policy’s action distribution class.

3

Action distributions can be sampled from.

Configuring RLlib Experiments

Now that you’ve seen the basic Python training API of RLlib in an example, let’s take a step back and discuss in more depth how to configure and run RLlib experiments. By now you know that to define an Algorithm, you start with the respective AlgorithmConfig and then build your algorithm from it. So far we’ve used only the rollout method of an AlgorithmConfig to set the number of rollout workers to two, and set our environment accordingly.

If you want to alter the behavior of your RLlib training run, chain more utility methods onto the AlgorithmConfig instance and then call build on it at the end. As RLlib algorithms are fairly complex, they come with many configuration options. To make things easier, the common properties of algorithms are naturally grouped into useful categories.14 Each such category comes with its own respective AlgorithmConfig method:

training()

Takes care of all training-related configuration options of your algorithm. The training method is the one place that RLlib algorithms differ in their configuration. All the following methods are algorithm-agnostic.

environment()

Configures all aspects of your environment.

rollouts()

Modifies the setup and behavior of your rollout workers.

exploration()

Alters the behavior of your exploration strategy.

resources()

Configures the compute resources used by your algorithm.

offline_data()

Defines options for training with so-called offline data, a topic we cover in “Working with Offline Data”.

multi_agent()

Specifies options for training algorithms using multiple agents. We discuss an explicit example of this in the next section.

The algorithm-specific configuration in training() becomes even more relevant once you’ve settled on an algorithm and want to squeeze it for performance. In practice, RLlib provides you with good defaults to get started.

For more details on configuring RLlib experiments, look up configuration arguments in the API reference for RLlib algorithms. But before we move on to examples, you should learn about the most common configuration options in practice.

Resource Configuration

Whether you use Ray RLlib locally or on a cluster, you can specify the resources used for the training process. Here are the most important options to consider. We continue using the DQN algorithm as an example, but this would apply to any other RLlib algorithm as well:

from ray.rllib.algorithms.dqn import DQNConfig

config = DQNConfig().resources(
    num_gpus=1,  1
    num_cpus_per_worker=2, 2
    num_gpus_per_worker=0, 3
)
1

Specify the number of GPUs to use for training. It’s important to check whether your algorithm of choice supports GPUs first. This value can also be fractional. For example, if using four rollout workers in DQN (num_rollout_workers=4), you can set num_gpus=0.25 to pack all four workers on the same GPU so that all rollout workers benefit from the potential speedup. This affects only the local learner process, not the rollout workers.

2

Set the number of CPUs to use for each rollout worker.

3

Set the number of GPUs used per worker.

Rollout Worker Configuration

RLlib lets you configure how your rollouts are computed and how to distribute them:

from ray.rllib.algorithms.dqn import DQNConfig

config = DQNConfig().rollouts(
    num_rollout_workers=4,  1
    num_envs_per_worker=1, 2
    create_env_on_local_worker=True, 3
)
1

You’ve seen this already. It specifies the number of Ray workers to use.

2

Specify the number of environments to evaluate per worker. This setting allows you to “batch” evaluation of environments. In particular, if your models take a long time to evaluate, grouping environments like this can speed up training.

3

When num_rollout_workers > 0, the driver (“local worker”) does not need an environment. That’s because sampling and evaluation is done by the rollout workers. If you still want an environment on the driver, you can set this option to True.

Environment Configuration

from ray.rllib.algorithms.dqn import DQNConfig

config = DQNConfig().environment(
    env="CartPole-v1",  1
    env_config={"my_config": "value"}, 2
    observation_space=None,
    action_space=None, 3
    render_env=True, 4
)
1

Specify the environment you want to use for training. This can be either a string of an environment known to Ray RLlib, such as any Gym environment, or the class name of a custom environment you’ve implemented.15

2

Optionally specify a dictionary of configuration options for your environment that will be passed to the environment constructor.

3

You can specify the observation and action spaces of your environment too. If you don’t specify them, they will be inferred from the environment.

4

False by default, this property allows you to turn on rendering of the environment, which requires you to implement the render method of your environment.

Note that we left out many available configuration options for each of the types we listed. On top of that, we can’t touch on aspects here that alter the behavior of the RL training procedure in this introduction (like modifying the underlying model to use). But the good news is that you’ll find all the information you need in the RLlib Training API documentation.

Working with RLlib Environments

So far we’ve introduced you to just Gym environments, but RLlib supports a wide variety of environments. After giving you a quick overview of all available options (see Figure 4-1), we’ll show you two concrete examples of advanced RLlib environments in action.

An Overview of RLlib Environments

All available RLlib environments extend a common BaseEnv class. If you want to work with several copies of the same gym.Env environment, you can use RLlib’s VectorEnv wrapper. Vectorized environments are useful, but they are straightforward generalizations of what you’ve seen already. The two other types of environments available in RLlib are more interesting and deserve more attention.

RLlib envs
Figure 4-1. An overview of all available RLlib environments

The first is called MultiAgentEnv, which allows you to train a model with multiple agents. Working with multiple agents can be tricky. That’s because you have to take care to define your agents within your environment with a suitable interface and account for the fact that each agent might have a completely different way of interacting with its environment.

What’s more is that agents might interact with each other, and they have to respect each other’s actions. In more advanced settings, there might even be a hierarchy of agents that explicitly depend on each other. In short, running multi-agent RL experiments is difficult, and we’ll see how RLlib handles this in the next example.

The other type of environment we will look at is called ExternalEnv, which can be used to connect external simulators to RLlib. For instance, imagine our simple maze problem from earlier was a simulation of an actual robot navigating a maze. It might not be suitable in such scenarios to co-locate the robot (or its simulation, implemented in a different software stack) with RLlib’s learning agents. To account for that, RLlib provides you with a simple client-server architecture for communicating with external simulators, which allows communication over a REST API. In case you want to work both in a multi-agent and external environment setting, RLlib offers a MultiAgentExternalEnv environment that combines both.

Working with Multiple Agents

The basic idea of defining multi-agent environments in RLlib is simple. You first assign each agent an agent ID. Then, whatever you previously defined as a single value in a Gym environment (observations, rewards, etc.), you now define as a dictionary with agent IDs as keys and values per agent. Of course, the details are a little more complicated than that in practice. But once you have defined an environment hosting several agents, you have to define how these agents should learn.

In a single-agent environment there’s one agent and one policy to learn. In a multi-agent environment there are multiple agents that might map to one or several policies. For instance, if you have a group of homogenous agents in your environment, then you could define a single policy for all of them. If they all act the same way, then their behavior can be learned the same way. In contrast, you might have situations with heterogeneous agents in which each of them has to learn a separate policy. Between these two extremes, there’s a spectrum of possibilities, as shown in Figure 4-2.

We continue to use our maze game as a running example for this chapter. This way you can check for yourself how the interfaces differ in practice. So, to put the ideas we just outlined into code, let’s define a multi-agent version of the GymEnvironment class. Our MultiAgentEnv class will have precisely two agents, which we encode in a Python dictionary called agents, but in principle this works with any number of agents.

Mapping envs
Figure 4-2. Mapping agents to policies in multi-agent reinforcement learning problems

We start by initializing and resetting our new environment:

from ray.rllib.env.multi_agent_env import MultiAgentEnv
from gym.spaces import Discrete
import os


class MultiAgentMaze(MultiAgentEnv):

    def __init__(self,  *args, **kwargs):  1
        self.action_space = Discrete(4)
        self.observation_space = Discrete(5*5)
        self.agents = {1: (4, 0), 2: (0, 4)}  2
        self.goal = (4, 4)
        self.info = {1: {'obs': self.agents[1]}, 2: {'obs': self.agents[2]}}  3

    def reset(self):
        self.agents = {1: (4, 0), 2: (0, 4)}

        return {1: self.get_observation(1), 2: self.get_observation(2)}  4
1

Action and observation spaces stay exactly the same as before.

2

We now have two seekers with (0, 4) and (4, 0) starting positions in an agents dictionary.

3

For the info object, we’re using agent IDs as keys.

4

Observations are now per-agent dictionaries.

Notice that we didn’t touch the action and observation spaces at all. That’s because we’re using two essentially identical agents here that can reuse the same spaces. In more complex situations you’d have to account for the fact that the actions and observations might look different for some agents.16

To continue, let’s generalize our helper methods get_observation, get_reward, and is_done to work with multiple agents. We do this by passing in an action_id to their signatures and handling each agent the same way as before:

    def get_observation(self, agent_id):
        seeker = self.agents[agent_id]
        return 5 * seeker[0] + seeker[1]

    def get_reward(self, agent_id):
        return 1 if self.agents[agent_id] == self.goal else 0

    def is_done(self, agent_id):
        return self.agents[agent_id] == self.goal

Next, to port the step method to our multi-agent setup, you have to know that MultiAgentEnv now expects the action passed to a step to be a dictionary with keys corresponding to the agent IDs, too. We define a step by looping through all available agents and acting on their behalf:17

    def step(self, action):  1
        agent_ids = action.keys()

        for agent_id in agent_ids:
            seeker = self.agents[agent_id]
            if action[agent_id] == 0:  # move down
                seeker = (min(seeker[0] + 1, 4), seeker[1])
            elif action[agent_id] == 1:  # move left
                seeker = (seeker[0], max(seeker[1] - 1, 0))
            elif action[agent_id] == 2:  # move up
                seeker = (max(seeker[0] - 1, 0), seeker[1])
            elif action[agent_id] == 3:  # move right
                seeker = (seeker[0], min(seeker[1] + 1, 4))
            else:
                raise ValueError("Invalid action")
            self.agents[agent_id] = seeker  2

        observations = {i: self.get_observation(i) for i in agent_ids}  3
        rewards = {i: self.get_reward(i) for i in agent_ids}
        done = {i: self.is_done(i) for i in agent_ids}

        done["__all__"] = all(done.values())  4

        return observations, rewards, done, self.info
1

Actions in a step are now per-agent dictionaries.

2

After applying the correct action for each seeker, set the correct states of all agents.

3

observations, rewards, and dones are also dictionaries with agent IDs as keys.

4

Additionally, RLlib needs to know when all agents are done.

The last step is to modify rendering the environment, which we do by denoting each agent by its ID when printing the maze to the screen:

    def render(self, *args, **kwargs):
        os.system('cls' if os.name == 'nt' else 'clear')
        grid = [['| ' for _ in range(5)] + ["|\n"] for _ in range(5)]
        grid[self.goal[0]][self.goal[1]] = '|G'
        grid[self.agents[1][0]][self.agents[1][1]] = '|1'
        grid[self.agents[2][0]][self.agents[2][1]] = '|2'
        grid[self.agents[2][0]][self.agents[2][1]] = '|2'
        print(''.join([''.join(grid_row) for grid_row in grid]))

Randomly rolling out an episode until one of the agents reaches the goal can, for instance, be done by the following code:18

import time

env = MultiAgentMaze()

while True:
    obs, rew, done, info = env.step(
        {1: env.action_space.sample(), 2: env.action_space.sample()}
    )
    time.sleep(0.1)
    env.render()
    if any(done.values()):
        break

Note how we have to make sure to pass two random samples by means of a Python dictionary into the step method, and how we check if any of the agents are done yet. We use this break condition for simplicity because it’s highly unlikely that both seekers find their way to the goal at the same time by chance. But of course we’d like both agents to complete the maze eventually.

In any case, equipped with our MultiAgentMaze, training an RLlib Algorithm works exactly the same way as before:

from ray.rllib.algorithms.dqn import DQNConfig

simple_trainer = DQNConfig().environment(env=MultiAgentMaze).build()
simple_trainer.train()

This covers the simplest case of training a multi-agent reinforcement learning (MARL) problem. But if you remember what we said earlier, when using multiple agents, there’s always a mapping between agents and policies. By not specifying such a mapping, both of our seekers were implicitly assigned to the same policy. This can be changed by calling the .multi_agent method on our DQNConfig and setting the policies and policy_mapping_fn arguments accordingly:

algo = DQNConfig()\
    .environment(env=MultiAgentMaze)\
    .multi_agent(
        policies={  1
            "policy_1": (
                None, env.observation_space, env.action_space, {"gamma": 0.80}
            ),
            "policy_2": (
                None, env.observation_space, env.action_space, {"gamma": 0.95}
            ),
        },
        policy_mapping_fn = lambda agent_id: f"policy_{agent_id}",  2
    ).build()

print(algo.train())
1

Define multiple policies for our agents, each with a different "gamma" value.

2

Each agent can then be mapped to a policy with a custom policy_mapping_fn.

As you can see, running multi-agent RL experiments is a first-class citizen of RLlib, and there’s a lot more that could be said about it. The support of MARL problems is probably one of RLlib’s strongest features.

Working with Policy Servers and Clients

For the last example in this section, let’s assume our original GymEnvironment can be simulated only on a machine that can’t run RLlib, for instance because it doesn’t have enough resources available. We can run the environment on a PolicyClient that can ask a respective server for suitable next actions to apply to the environment. The server, in turn, does not know about the environment. It only knows how to ingest input data from a PolicyClient, and it is responsible for running all RL-related code; in particular, it defines an RLlib AlgorithmConfig object and trains an Algorithm.

Typically, you want to run the server that trains your algorithm on a powerful Ray Cluster, and then the respective client runs outside that cluster. Figure 4-3 schematically illustrates this setup.

RLlib external app
Figure 4-3. Working with policy servers and clients in RLlib

Defining a server

Let’s start by defining the server side of such an application first. We define a so-called PolicyServerInput that runs on localhost on port 9900. This policy input is what the client will provide later. With this policy_input defined as input to our algorithm configuration, we can define yet another DQN to run on the server:

# policy_server.py
import ray
from ray.rllib.agents.dqn import DQNConfig
from ray.rllib.env.policy_server_input import PolicyServerInput
import gym


ray.init()


def policy_input(context):
    return PolicyServerInput(context, "localhost", 9900)  1


config = DQNConfig()\
    .environment(
        env=None,  2
        action_space=gym.spaces.Discrete(4),  3
        observation_space=gym.spaces.Discrete(5*5))\
    .debugging(log_level="INFO")\
    .rollouts(num_rollout_workers=0)\
    .offline_data(  4
        input=policy_input,
        input_evaluation=[])


algo = config.build()
1

The policy_input function returns a PolicyServerInput object running on localhost on port 9900.

2

We explicitly set the env to None because this server does not need one.

3

We therefore need to define both an observation_space and an action_space, as the server is not able to infer them from the environment.

4

To make this work, we need to feed our policy_input into the experiment’s input.

With this algo defined,19 we can now start a training session on the server like so:

# policy_server.py
if __name__ == "__main__":

    time_steps = 0
    for _ in range(100):
        results = algo.train()
        checkpoint = algo.save()  1
        if time_steps >= 1000:  2
            break
        time_steps += results["timesteps_total"]
1

Train for a maximum of 100 iterations and store checkpoints after each iteration.

2

If training surpasses 1,000 time steps, we stop the training.

In what follows we assume that you store the last two code snippets in a file called policy_server.py. If you want to, you can now start this policy server on your local machine by running python policy_server.py in a terminal.

Defining a client

Next, to define the corresponding client side of the application, we define a PolicyClient that connects to the server we just started. Since we can’t assume that you have several computers at home (or available in the cloud), contrary to what we said prior, we will start this client on the same machine. In other words, the client will connect to http://localhost:9900, but if you can run the server on a different machine, you could replace localhost with the IP address of that machine, provided it’s available in the network.

Policy clients have a fairly lean interface. They can trigger the server to start or end an episode, get next actions from it, and log reward information to it (that it would otherwise not have). With that said, here’s how you define such a client:

# policy_client.py
import gym
from ray.rllib.env.policy_client import PolicyClient
from maze_gym_env import GymEnvironment

if __name__ == "__main__":
    env = GymEnvironment()
    client = PolicyClient("http://localhost:9900", inference_mode="remote")  1

    obs = env.reset()
    episode_id = client.start_episode(training_enabled=True)  2

    while True:
        action = client.get_action(episode_id, obs)  3

        obs, reward, done, info = env.step(action)

        client.log_returns(episode_id, reward, info=info)  4

        if done:
            client.end_episode(episode_id, obs)  5
            exit(0)  6
1

Start a policy client on the server address with remote inference mode.

2

Tell the server to start an episode.

3

For given environment observations, we can get the next action from the server.

4

It’s mandatory for the client to log reward information to the server.

5

If a certain condition is reached, we can stop the client process.

6

If the environment is done, we have to inform the server about episode completion.

Assuming you store this code under policy_client.py and start it by running python policy_client.py, then the server that we started earlier will start learning with environment information solely obtained from the client.

Advanced Concepts

So far we’ve been working with simple environments that were easy enough to tackle with the most basic RL algorithm settings in RLlib. Of course, in practice you’re not always that lucky and might have to come up with other ideas to tackle more difficult environments. In this section we’re going to introduce a slightly harder version of the maze environment and discuss some advanced concepts to help you solve it.

Building an Advanced Environment

Let’s make our maze GymEnvironment a bit more challenging. First, we increase its size from a 5 × 5 to an 11 × 11 grid. Then we introduce obstacles in the maze that the agent can pass through but only by incurring a penalty, a negative reward of –1. This way our seeker agent will have to learn to avoid obstacles while still finding the goal. Also, we randomize the agent’s starting position. All of this makes the RL problem harder to solve. Let’s look at the initialization of this new AdvancedEnv first:

from gym.spaces import Discrete
import random
import os


class AdvancedEnv(GymEnvironment):

    def __init__(self, seeker=None, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.maze_len = 11
        self.action_space = Discrete(4)
        self.observation_space = Discrete(self.maze_len * self.maze_len)

        if seeker:  1
            assert 0 <= seeker[0] < self.maze_len and \
                   0 <= seeker[1] < self.maze_len
            self.seeker = seeker
        else:
            self.reset()

        self.goal = (self.maze_len-1, self.maze_len-1)
        self.info = {'seeker': self.seeker, 'goal': self.goal}

        self.punish_states = [  2
            (i, j) for i in range(self.maze_len) for j in range(self.maze_len)
            if i % 2 == 1 and j % 2 == 0
        ]
1

Set the seeker position upon initialization.

2

Introduce punish_states as obstacles for the agent.

Next, when resetting the environment, we want to make sure to reset the agent’s position to a random state.20 We also increase the positive reward for reaching the goal to 5 to offset the negative reward for passing through an obstacle (which will happen a lot before the RL algorithm picks up on the obstacle locations). Balancing rewards like this is a crucial task in calibrating your RL experiments:

    def reset(self):
        """Reset seeker position randomly, return observations."""
        self.seeker = (
            random.randint(0, self.maze_len - 1),
            random.randint(0, self.maze_len - 1)
        )
        return self.get_observation()

    def get_observation(self):
        """Encode the seeker position as integer"""
        return self.maze_len * self.seeker[0] + self.seeker[1]

    def get_reward(self):
        """Reward finding the goal and punish forbidden states"""
        reward = -1 if self.seeker in self.punish_states else 0
        reward += 5 if self.seeker == self.goal else 0
        return reward

    def render(self, *args, **kwargs):
        """Render the environment, e.g. by printing its representation."""
        os.system('cls' if os.name == 'nt' else 'clear')
        grid = [['| ' for _ in range(self.maze_len)] +
                ["|\n"] for _ in range(self.maze_len)]
        for punish in self.punish_states:
            grid[punish[0]][punish[1]] = '|X'
        grid[self.goal[0]][self.goal[1]] = '|G'
        grid[self.seeker[0]][self.seeker[1]] = '|S'
        print(''.join([''.join(grid_row) for grid_row in grid]))

There are many other ways you could make this environment more difficult, like making it much bigger, introducing a negative reward for every step the agent takes in a certain direction, or punishing the agent for trying to walk off the grid. By now you should understand the problem setting well enough to customize the maze further.

While you might have success training this environment, this is a good opportunity to introduce some advanced concepts that you can apply to other RL problems.

Applying Curriculum Learning

One of the most interesting features of RLlib is providing an Algorithm with a curriculum to learn from. Instead of letting the algorithm learn from arbitrary environment setups, we cherry-pick states that are much easier to learn from and then slowly but surely introduce more difficult states. Building a learning curriculum is a great way to make your experiments converge to solutions quicker. To apply curriculum learning, the only thing you need is a view on which starting states are easier than others. This can be a challenge for many environments, but it’s easy to come up with a simple curriculum for our advanced maze. Namely, the distance of the seeker from the goal can be used as a measure of difficulty. The distance measure we’ll use for simplicity is the sum of the absolute distance of both seeker coordinates from the goal to define a difficulty.

To run curriculum learning with RLlib, we define a CurriculumEnv that extends both our AdvancedEnv and a so-called TaskSettableEnv from RLLib. The interface of TaskSettableEnv is very simple in that you have to define only how to get the current difficulty (get_task) and how to set a required difficulty (set_task). Here’s the full definition of this CurriculumEnv:

from ray.rllib.env.apis.task_settable_env import TaskSettableEnv


class CurriculumEnv(AdvancedEnv, TaskSettableEnv):

    def __init__(self, *args, **kwargs):
        AdvancedEnv.__init__(self)

    def difficulty(self):  1
        return abs(self.seeker[0] - self.goal[0]) + \
               abs(self.seeker[1] - self.goal[1])

    def get_task(self):  2
        return self.difficulty()

    def set_task(self, task_difficulty):  3
        while not self.difficulty() <= task_difficulty:
            self.reset()
1

Define the difficulty of the current state as the sum of the absolute distance of both seeker coordinates from the goal.

2

To define get_task we can then simply return the current difficulty.

3

To set a task difficulty, we reset the environment until its difficulty is at most the specified task_difficulty.

To use this environment for curriculum learning, we need to define a curriculum function that tells the algorithm when and how to set the task difficulty. We have many options here, but we use a schedule that simply increases the difficulty by one every 1,000 time steps trained:

def curriculum_fn(train_results, task_settable_env, env_ctx):
    time_steps = train_results.get("timesteps_total")
    difficulty = time_steps // 1000
    print(f"Current difficulty: {difficulty}")
    return difficulty

To test this curriculum function, we need to add it to our RLlib algorithm config by setting the env_task_fn property to our curriculum_fn. Note that before training a DQN for 15 iterations, we also set an output folder in our config. This will store experience data of our training run to the specified temp folder:21

from ray.rllib.algorithms.dqn import DQNConfig
import tempfile


temp = tempfile.mkdtemp()  1

trainer = (
    DQNConfig()
    .environment(env=CurriculumEnv, env_task_fn=curriculum_fn)  2
    .offline_data(output=temp)  3
    .build()
)

for i in range(15):
    trainer.train()
1

Create a temp file to store our training data for later use.

2

Set the CurriculumEnv as our environment in the environment part of our config and assign our curriculum_fn to the env_task_fn property.

3

Use the offline_data method to store output in our temp folder.

Running this algorithm, you should see how the task difficulty increases over time, thereby giving the algorithm easy examples to start with so that it can learn from them and progress to more difficult tasks.

Curriculum learning is a great technique to be aware of and RLlib allows you to easily incorporate it into your experiments through the curriculum API we just discussed.

Working with Offline Data

In our previous curriculum learning example we stored training data to a temporary folder. What’s interesting is that you already know from Chapter 3 that in Q-Learning you can collect experience data first and decide when to use it in a training step later. This separation of data collection and training opens up many possibilities. For instance, maybe you have a good heuristic that can solve your problem in an imperfect yet reasonable manner. Or you have records of human interaction with your environment, demonstrating how to solve the problem by example.

The topic of collecting experience data for later training is often discussed as working with offline data. It’s called “offline” because it’s not directly generated by a policy interacting online with the environment. Algorithms that don’t rely on training on their own policy output are called off-policy algorithms, and Q-Learning, particularly DQN, is just one such example. Algorithms that don’t share this property are called on-policy algorithms. In other words, off-policy algorithms can be used to train on offline data.22

To use the data we stored in the temp folder, we can create a new DQNConfig that takes this folder as input. We will also set explore to False, since we simply want to exploit the data previously collected for training—the algorithm will not explore according to its own policy.

Using the resulting RLlib algorithm works exactly as before, which we demonstrate by training it for 10 iterations and then evaluating it:

imitation_algo = (
    DQNConfig()
    .environment(env=AdvancedEnv)
    .evaluation(off_policy_estimation_methods={})
    .offline_data(input_=temp)
    .exploration(explore=False)
    .build())

for i in range(10):
    imitation_algo.train()

imitation_algo.evaluate()

Note that we called the algorithm imitation_algo. That’s because this training procedure intends to imitate the behavior reflected in the data we collected before. This type of learning by demonstration in RL is therefore often called imitation learning or behavior cloning.

Other Advanced Topics

Before concluding this chapter, let’s have a look at a few other advanced topics that RLlib has to offer. You’ve already seen how flexible RLlib is: working with a range of different environments, configuring your experiments, training on a curriculum, or running imitation learning. This section gives you a taste of what else is possible.

With RLlib, you can completely customize the models and policies used under the hood. If you’ve worked with deep learning before, you know how important it can be to have a good model architecture in place. In RL this is often not as crucial as in supervised learning, but it is still a vital part of successfully running advanced experiments.

You can also change the way your observations are preprocessed by providing custom preprocessors. For our simple maze examples, there was nothing to preprocess, but when working with image or video data, preprocessing is often a crucial step.

In our AdvancedEnv we introduced states to avoid. Our agents had to learn to do this, but RLlib has a feature to automatically avoid them through so-called parametric action spaces. Loosely speaking, what you can do is “mask out” all undesired actions from the action space for each point in time. In some cases it can also be necessary to have variable observation spaces, which is also fully supported by RLlib.

We briefly touched on the topic of offline data. RLlib has a full-fledged Python API for reading and writing experience data that can be used in various situations.

We have worked solely with DQN here for simplicity, but RLlib has an impressive range of training algorithms. To name just one, the MARWIL algorithm is a complex hybrid algorithm with which you can run imitation learning from offline data, while also mixing in regular training on data generated “online.”

Summary

You’ve seen a selection of interesting RLlib features in this chapter. We covered training multi-agent environments, working with offline data generated by another agent, setting up a client-server architecture to split simulations from RL training, and using curriculum learning to specify increasingly difficult tasks.

We’ve also given you a quick overview of the main concepts underlying RLlib and how to use its CLI and Python API. In particular, we’ve shown how to configure your RLlib algorithms and environments to your needs. As we’ve covered only a small part of RLlib’s possibilities, we encourage you to read its documentation and explore its API.

In the next chapter you’ll learn how to tune the hyperparameters of your RLlib models and policies with Ray Tune.

1 We’re using a simple game to illustrate the process of RL. There is a multitude of interesting industry applications of RL that are not games.

2 We don’t cover this integration in this book, but you can learn more about deploying RLlib models in the “Serving RLlib Models” tutorial in the Ray documentation.

3 From Ray 2.3.0 onward, RLlib will be using the Gymnasium library as drop-in replacement for Gym. This will likely introduce some breaking changes, so it’s best to stick with Ray 2.2.0 to follow this chapter.

4 Gym comes with a variety of interesting environments that are worth exploring. For instance, you can find many of the Atari environments that were used in the famous “Playing Atari with Deep Reinforcement Learning” paper from DeepMind, or advanced physics simulations using the MuJoCo engine.

5 To be precise, RLlib uses a double and dueling DQN.

6 In the GitHub repo for this book we’ve also included an equivalent maze.yml file that you could use via rllib train file maze.yml (no --type needed).

7 Of course, configuring your models is a crucial part of RL experiments. We will discuss configuration of RLlib algorithms in more detail in the next section.

8 If you set num_rollout_workers to 0, only the local worker on the head node will be created, and all sampling from the env is done there. This is particularly useful for debugging, as no additional Ray actor processes are spawned.

9 The Policy class in RLlib today will be replaced in a future release. The new Policy class will likely be a drop-in replacement for the most part and exhibit some minor differences. The idea of the class remains the same, though: a policy is a class that encapsulates the logic of choosing actions given observations, and it gives you access to the underlying models used.

10 Technically speaking, only the local model is used for actual training. The two worker models are used for action computation and data collection (rollouts). After each training step, the local model sends its current weights to the workers for synchronization. Fully distributed training, as opposed to distributed sampling, will be available across all RLlib algorithms in future Ray versions.

11 This is true by default, since we’re using TensorFlow and Keras under the hood. Should you opt to change the framework specification of your algorithm to work with PyTorch directly, do print(model), in which case model is-a torch.nn.Module. Access to the underlying model will be unified across all frameworks the future.

12 The “value” output of this network represents the Q-value of state-action pairs.

13 To learn more about customizing your RLlib models, check out the guide to custom models in the Ray documentation.

14 We list only the methods we introduce in this chapter. Apart from those we mention, you also find options for evaluation of your algorithms, reporting, debugging, checkpointing, adding callbacks, altering your deep learning framework, requesting resources, and accessing experimental features.

15 There’s also a way to register your environments so that you can refer to them by name, but this requires using Ray Tune. You will learn about this feature in Chapter 5.

16 You can find a good example that defines different observation and action spaces for multiple agents in the RLlib documentation.

17 Note how this can lead to issues like deciding which agent gets to act first. In our simple maze problem the order of actions is irrelevant, but in more complex scenarios this becomes a crucial part of modeling the RL problem correctly.

18 Deciding when an episode is done is a crucial part of multi-agent RL, and it depends entirely on the problem at hand and what you want to achieve.

19 For technical reasons, we have to specify observation and action spaces here, which might not be necessary in future releases of RLlib, as it leaks environment information. Also note that we need to set input_evaluation to an empty list to make this server work.

20 In the definition of reset, we allow the seeker to reset on top of the goal to keep the definition simpler. Allowing this trivial edge case does not affect learning.

21 Note that if you run the notebook for this chapter on the cloud, the training process could take a while to finish.

22 Note that RLlib has a wide range of on-policy algorithms like PPO as well.

Get Learning Ray now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.