Before we get into the finer details of how PPO works, we need to step back and understand how we equate the difference in distributed data distributions or just distributions. Remember that PG methods look to understand the returns-based sampling distribution and then use that to find the optimum action or the probability of the optimum action. Due to this, we can use a method called KL Divergence to determine how different the two distributions are. By understanding this, we can determine how much room or area of trust we can allow our optimization algorithm to explore with. PPO improves on this by clipping the objective function by using two policy networks.
PPO and clipped objectives
Jonathan Hui has a number of insightful blog posts ...
Get Hands-On Reinforcement Learning for Games now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.