October 2019
Intermediate to advanced
366 pages
12h 4m
English
As Q-learning is a TD method, it needs a behavior policy that, as time passes, will converge to a deterministic policy. A good strategy is to use an
-greedy policy with linear or exponential decay (as has been done for SARSA).
To recap, the Q-learning algorithm uses the following:
-greedy policy to interact with and explore the environmentAfter these conclusive observations, we can finally come up with the following pseudocode for the Q-learning algorithm:
Initialize ...
Read now
Unlock full access