19

Trust Regions – PPO, TRPO, ACKTR, and SAC

Next, we will take a look at the approaches used to improve the stability of the stochastic policy gradient method. Some attempts have been made to make the policy improvement more stable, and in this chapter, we will focus on three methods:

  • Proximal policy optimization (PPO)
  • Trust region policy optimization (TRPO)
  • Advantage actor-critic (A2C) using Kronecker-factored trust region (ACKTR).

In addition, we will compare those methods to a relatively new off-policy method called soft actor-critic (SAC), which is a development of the deep deterministic policy gradients (DDPG) method described in Chapter 17, Continuous Action Space. To compare them to the A2C baseline, we will use several environments ...

Get Deep Reinforcement Learning Hands-On - Second Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.