Policy gradient methods

All RL algorithms we discussed until now have tried to learn the state- or action-value functions. For example, in Q-learning we usually follow an ε-greedy policy, which has no parameters (OK, it has one parameter) and relies on the value function instead. In this section, we'll discuss something new: how to approximate the policy itself with the help of policy gradient methods. We'll follow a similar approach as in Chapter 8Reinforcement Learning Theory, in the Value function approximation section.

There, we introduced a value approximation function, which is described by a set of parameters w (neural net weights). Here, we'll introduce a parameterized policy   , which is described by a set of parameters θ. As with ...

Get Python Deep Learning - Second Edition now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.