October 2019
Intermediate to advanced
366 pages
12h 4m
English
The objective of RL is to maximize the expected return (the total reward, discounted or undiscounted) of a trajectory. The objective function, can then be expressed as:

Where θ is the parameters of the policy, such as the trainable variables of a deep neural network.
In PG methods, the maximization of the objective function is done through the gradient of the objective function
. Using gradient ascent, we can improve by moving the parameters toward the direction of the gradient, as the gradient points in the direction ...
Read now
Unlock full access