How do we use this reward signal to update the Controller? Remember, this reward signal is not differentiable like a loss function in supervised learning; we may not backpropagate this through the Controller on its own. Instead, we employ a policy gradient method called REINFORCE to iteratively update the Controller parameters, . In REINFORCE, the gradient of the reward function, J, with respect to the parameters of the Controller, , is defined as follows:
You may recall seeing a similar expression in Chapter 6, Learning ...