October 2019
Intermediate to advanced
366 pages
12h 4m
English
From a broad perspective, TRPO can be seen as a continuation of the NPG algorithm for nonlinear function approximation. The biggest improvement that was introduced in TRPO is the use of a constraint on the KL divergence between the new and the old policy that forms a trust region. This allows the network to take larger steps, always within the trust region. The resulting constraint problem is formulated as follows:
(7.2)
Here,
is the objective surrogate function that we'll see soon, is the KL divergence between the old ...
Read now
Unlock full access