October 2019
Intermediate to advanced
340 pages
8h 39m
English
In the REINFORCE algorithm, Monte Carlo plays out the whole trajectory in an episode that is used to update the policy afterward. However, the stochastic policy may take different actions at the same state in different episodes. This can confuse the training, since one sampled experience wants to increase the probability of choosing one action while another sampled experience may want to decrease it. To reduce this high variance problem in vanilla REINFORCE, we will develop a variation algorithm, REINFORCE with baseline, in this recipe.
In REINFORCE with baseline, we subtract the baseline state-value from the return, G. As a result, we use an advantage function A in the gradient update, which ...
Read now
Unlock full access