Expected SARSA
Vanilla SARSA is quite similar to Q-learning in terms of how we choose values. It will generally just use an epsilon-greedy max action strategy, not unlike what we used previously; however, what we find, especially when working on-policy, is that the algorithm needs to be more selective. Now, this is very much the goal of all RL, but, in this particular case, we manage these trade-offs a bit better by introducing an expectation. When we combine this with SARSA, we call it expected SARSA.
In expected SARSA, we assume an unknown learning rate alpha, and hence an unknown exploration rate epsilon as well. Instead, we equate the learning rate alpha and exploration rate epsilon using functions based on assigned rewards. We assign ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access