Policy optimization

Policy optimization methods are an alternative to Q-learning and value function approximation. Instead of learning the Q-values for state/action pairs, these methods directly learn a policy π that maps state to an action by calculating a gradient. Fundamentally, for a search such as for an optimization problem, policy methods are a means of learning the correct policy from a stochastic distribution of potential policy actions. Therefore, our network architecture changes a bit to learn a policy directly:

Because every state has a distribution of possible actions, the optimization problem becomes easier. We no longer have ...

Get Hands-On Artificial Intelligence for Beginners now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.