Policy gradient methods

All RL algorithms we discussed until now have tried to learn the state- or action-value functions. For example, in Q-learning we usually follow an ε-greedy policy, which has no parameters (OK, it has one parameter) and relies on the value function instead. In this section, we'll discuss something new: how to approximate the policy itself with the help of policy gradient methods. We'll follow a similar approach as in Chapter 8, Reinforcement Learning Theory, in the Value function approximation section.

There, we introduced a value approximation function, which is described by a set of parameters w (neural net weights). Here, we'll introduce a parameterized policy , which is described by a set of parameters θ. As with ...

Get Python Deep Learning - Second Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.