January 2019
Intermediate to advanced
386 pages
11h 13m
English
One issue with the value-function approximation in Q-learning is that we use the same network to compute both the estimation at the t time and the TD target value, which is based on the estimation at the t+1 time (preceding equation). Let's say that we update the network weights at step t with the TD target at t+1. In the next iteration, we'll calculate the next TD target at step t+2 (two) using the updated network. As a result, there is a strong correlation between the TD target and the network weights. When the weights change, so does the TD target. Think of it as a moving goalpost – as the network tries to get closer to the TD target, the target shifts and goes further away. This could lead to oscillations and unstable ...