Policy Gradient Methods

Previously, our reinforcement learning (RL) methods have focused on finding the maximum or best value for choosing a particular action in any given state. While this has worked well for us in previous chapters, it certainly is not without its own problems, one of which is always determining when to actually take the max or best action, hence our exploration/exploitation trade-off. As we have seen, the best action is not always the best and it can be better to take the average of the best. However, mathematically averaging is dangerous and tells us nothing about what the agent actually sampled in the environment. Ideally, we want a method that can learn the distribution of actions for each state in the environment. ...

Get Hands-On Reinforcement Learning for Games now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.