December 2018
Beginner to intermediate
684 pages
21h 9m
English
Q-learning has been shown to overestimate action values because it purposely samples maximal estimated action values. This bias can negatively affect the learning process and the resulting policy if it does not apply uniformly and alters action preferences, as shown by Hado van Hasselt in Deep Reinforcement Learning with Double Q-learning (2015: https://arxiv.org/abs/1509.06461).
To decouple the estimation of action values from the selection of actions, Double Deep Q-Learning (DDQN) uses the weights, θ, of one network to select the best action given the next state, and the weights, θ', of another network to provide the corresponding action value estimate, that is:
One option is to randomly select one of two identical ...