Results
Evaluating the progress of an RL algorithm is very challenging. The most obvious way to do this is to keep track of its end goal; that is, monitoring the total reward that's accumulated during the epochs. This is a good metric. However, training the average reward can be very noisy due to changes in the weights. This leads to large changes in the distribution of the state that's being visited.
For these reasons, we evaluated the algorithm on 10 test games every 20 training epochs and kept track of the average of the total (non-discounted) reward that was accumulated throughout the games. Moreover, because of the determinism of the environment, we tested the agent on an -greedy policy (with ) so that we have a more robust evaluation. ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access