Trust region methods
A further optimization we can apply to gradient ascent is using trust regions or controlled regions of updates. These methods are of course fundamental to TRPO given the name but the concept is further extended to other policy-based methods. In TRPO, we extend regions of trust over the approximation functions using the Minorize-Maximization or MM algorithm. The intuition of MM is that there is a lower bound function that we can always expect the returns/reward to be higher than. Hence, if we maximize this lower bound function, we also attain our best policy. Gradient descent by default is a line search algorithm but this again introduces the problem of overshooting. Instead, we can first approximate the step size and ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access