January 2019
Intermediate to advanced
386 pages
11h 13m
English
Sarsa is an on-policy TD control method. Much such as MC control, we'll try to estimate the action-value function in order to find the optimal policy. We'll do this for the same reasons we outlined in the Exploring starts policy improvement section. But this time, we'll follow the blueprint outlined in the preceding section. That is, we'll iterate over multiple episodes and we'll update
online, after each step of an episode. We can represent this process with a formula, similar to the one in the preceding section, with the exception that it is for the action-value function:
Where for each action of the terminal state. ...