N-step DQN
The idea behind n-step DQN is old and comes from the shift between temporal difference learning and Monte Carlo learning. These algorithms, which were introduced in Chapter 4, Q-Learning and SARSA Applications, are at the opposite extremes of a common spectrum. TD learning learns from a single step, while MC learns from the complete trajectory. TD learning exhibits a minimal variance but a maximal bias, where as MC exhibits high variance but a minimal bias. The variance-bias problem can be balanced using an n-step return. An n-step return is a return computed after n steps. TD learning can be viewed as a 0-step return while MC can be viewed as a -step return.
With the n-step return, we can update the target value, as follows:
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access