October 2019
Intermediate to advanced
366 pages
12h 4m
English
From the previous chapter, Solving Problems with Dynamic Programming we know the following:

Empirically, the Monte Carlo update estimates this value by averaging returns from multiple full trajectories. Developing the equation further, we obtain the following:

The preceding equation is approximated by the DP algorithms. The difference is that TD algorithms estimate the expected value instead of computing it. The estimate is done in the same way as Monte Carlo methods do, by averaging:
In practice, instead of calculating the average, ...
Read now
Unlock full access