Policy Gradient Methods
Previously, our reinforcement learning (RL) methods have focused on finding the maximum or best value for choosing a particular action in any given state. While this has worked well for us in previous chapters, it certainly is not without its own problems, one of which is always determining when to actually take the max or best action, hence our exploration/exploitation trade-off. As we have seen, the best action is not always the best and it can be better to take the average of the best. However, mathematically averaging is dangerous and tells us nothing about what the agent actually sampled in the environment. Ideally, we want a method that can learn the distribution of actions for each state in the environment. ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access