Chapter 15. Trust Regions – TRPO, PPO, and ACKTR

In this chapter, we'll take a look at the approaches used to improve the stability of the stochastic policy gradient method. Some attempts have been made to make the policy improvement more stable and we'll focus on three methods: Proximal Policy Optimization (PPO), Trust Region Policy Optimization (TRPO) and Actor-Critic (A2C) using Kronecker-Factored Trust Region (ACKTR).

To compare them to the A2C baseline, we'll use several environments from the roboschool library created by OpenAI.

Introduction

The overall motivation of the methods that we'll take a look at is to improve the stability of the policy update during the training. Intuitively, there is a dilemma: on the one hand, we'd like to train ...

Get Deep Reinforcement Learning Hands-On now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.