Policy gradients for learning policy functions

The problem policy gradients aims to solve is a more general version of the problem of reinforcement learning, which is how you can use backpropagation on a task that has no gradient, from the reward to the output of our parameters. To give a more concrete example, we have a neural network that produces the probability of taking an action a, given a state s and some parameters ?, which are the weights of our neural network:

Policy gradients for learning policy functions

We also have our reward signal R. The actions affect the reward signal we take, but there is no gradient between them and the parameters. There is no equation in which we can plug ...

Get Python Deep Learning now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.