April 2018
Intermediate to advanced
334 pages
10h 18m
English
The TD(0) rule finds the value estimate if the finite data is repeated infinitely, often that is if we take this finite data and keep running the following estimate update rule over and over again. Then in reality we are averaging out each of the transitions.

Thus, value function is given by the following:

This is not true in case of the outcome-based model where we don't use an estimate of the state, that is
, rather we use use the ...