April 2017
Intermediate to advanced
318 pages
7h 40m
English
Based on the equations that represent the Q-value for a state action pair (st, at) in terms of the current reward rt and the discounted maximum Q-value for the next time step (st+1, at+1), our strategy would logically be to train the network to predict the best next state s' given the current state (s, a, r). It turns out that this tends to drive the network into a local minimum. The reason for this is that consecutive training samples tend to be very similar.
To counter this, during game play, we collect all the previous moves (s, a, r, s') into a large fixed size queue called the replay memory. The replay memory represents the experience of the network. When training the network, we generate ...