O'Reilly logo

Learn Unity ML-Agents - Fundamentals of Unity Machine Learning by Micheal Lanham

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Asynchronous actor – critic training

Thus far, we have assumed that the internal training structure of PPO mirrors what we learned when we first looked at neural networks and DQN. However, this isn't actually the case. Instead of using a single network to derive Q values or some form of policy, the PPO algorithm uses a technique called actor–critic. This method is essentially a combination of calculating values and policy. In actor–critic, or A3C, we train two networks. One network acts as a Q-value estimate or critic, and the other determines the policy or actions of the actor or agent.

We compare these values in the following equation to determine the advantage:

However, the network is no longer calculating Q-values, so we substitute that ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required