Implementation of PPO

Now that we have the basic ingredients of PPO, we can implement it using Python and TensorFlow.

The structure and implementation of PPO is very similar to the actor-critic algorithms but with only a few additional parts, all of which we'll explain here.

One such addition is the generalized advantage estimation (7.11) that takes just a few lines of code using the already implemented discounted_rewards function, which computes (7.12):

def GAE(rews, v, v_last, gamma=0.99, lam=0.95):    vs = np.append(v, v_last)    delta = np.array(rews) + gamma*vs[1:] - vs[:-1]    gae_advantage = discounted_rewards(delta, 0, gamma*lam)    return gae_advantage

The GAE function is used in the store method of the Buffer class when a trajectory is stored: ...

Get Reinforcement Learning Algorithms with Python now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.