TRPO and PPO Implementation
In the previous chapter, we looked at policy gradient algorithms. Their uniqueness lies in the order in which they solve a reinforcement learning (RL) problem—policy gradient algorithms take a step in the direction of the highest gain of the reward. The simpler version of this algorithm (REINFORCE) has a straightforward implementation that alone achieves good results. Nevertheless, it is slow and has a high variance. For this reason, we introduced a value function that has a double goal—to critique the actor and to provide a baseline. Despite their great potential, these actor-critic algorithms can suffer from unwanted rapid variations in the action distribution that may cause a drastic change in the states that ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access