October 2019
Intermediate to advanced
366 pages
12h 4m
English
The performance graph is shown in the following diagram:

The reward is plotted as a function of the number of interactions with the real environment. After 900 steps and about 15 games, the agent achieves the top performance of 1,000. The policy updated itself 15 times and learned from 750,000 simulated steps. From a computational point of view, the algorithm trained for about 2 hours on a mid-range computer.
We noted that the results have very high variability and, if trained with different random seeds, you can obtain very different performance curves. This is also true for model-free algorithms, but ...
Read now
Unlock full access