September 2018
Intermediate to advanced
296 pages
9h 10m
English
How do we use this reward signal to update the Controller? Remember, this reward signal is not differentiable like a loss function in supervised learning; we may not backpropagate this through the Controller on its own. Instead, we employ a policy gradient method called REINFORCE to iteratively update the Controller parameters,
. In REINFORCE, the gradient of the reward function, J, with respect to the parameters of the Controller,
, is defined as follows:
You may recall seeing a similar expression in Chapter 6, Learning ...
Read now
Unlock full access