April 2018
Intermediate to advanced
334 pages
10h 18m
English
Assuming our given policy
is differentiable whenever it's non zero, then the gradient of the given policy with respect to
would be
. Therefore, we can further exploit this gradient quantity in the form of the likelihood ratio as follows:

Here, is the score function for future reference.
Now, let's consider a simple one-step ...