One of the issues with deep Q learning is that we use the same network weights W to estimate the target and the Q value. As a result, there is a large correlation between the Q values we are predicting and the target Q values, since they both use the same changing weights. This makes both the predicted and the target Q values shift at every step of training, leading to oscillations.
To stabilize this, we use a copy of the original network to estimate the target Q values and the weights of the target network is copied from the original network at specific intervals during the steps. This variant of the deep Q learning network is called double deep Q learning and generally leads to stable training. The working mechanics ...