12

The Actor-Critic Method

In Chapter 11, Policy Gradients—an Alternative, we started to investigate a policy-based alternative to the familiar value-based methods family. In particular, we focused on the method called REINFORCE and its modification, which uses discounted reward to obtain the gradient of the policy (which gives us the direction in which to improve the policy). Both methods worked well for a small CartPole problem, but for a more complicated Pong environment, the convergence dynamics were painfully slow.

Next, we will discuss another extension to the vanilla policy gradient method, which magically improves the stability and convergence speed of that method. Despite the modification being only minor, the new method has its own ...

Get Deep Reinforcement Learning Hands-On - Second Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.