October 2019
Intermediate to advanced
366 pages
12h 4m
English
In reality, as we already saw in TD learning, a fully online algorithm has low variance but high bias, the opposite of MC learning. However, usually, a middle-ground strategy, between fully online and MC methods, is preferred. To balance this trade-off, an n-step return can replace a one-step return of online algorithms.
If you remember, we already implemented n-step learning in the DQN algorithm. The only difference is that DQN is an off-policy algorithm, and in theory, n-step can be employed only on on-policy algorithms. Nevertheless, we showed that with a small
, the performance increased.
AC algorithms are on-policy, ...
Read now
Unlock full access