Like we did in Monte Carlo prediction, in TD prediction we try to predict the state values. In Monte Carlo prediction, we estimate the value function by simply taking the mean return. But in TD learning, we update the value of a previous state by current state. How can we do this? TD learning using something called a TD update rule for updating the value of a state, as follows:
The value of a previous state = value of previous state + learning_rate (reward + discount_factor(value of current state) - value of previous state)
What does this equation actually mean?
If you think of this equation intuitively, it is actually the difference ...