Now that we have the basic ingredients of PPO, we can implement it using Python and TensorFlow.
The structure and implementation of PPO is very similar to the actor-critic algorithms but with only a few additional parts, all of which we'll explain here.
One such addition is the generalized advantage estimation (7.11) that takes just a few lines of code using the already implemented discounted_rewards function, which computes (7.12):
def GAE(rews, v, v_last, gamma=0.99, lam=0.95): vs = np.append(v, v_last) delta = np.array(rews) + gamma*vs[1:] - vs[:-1] gae_advantage = discounted_rewards(delta, 0, gamma*lam) return gae_advantage
The GAE function is used in the store method of the Buffer class when a trajectory is stored: ...