January 2019
Intermediate to advanced
386 pages
11h 13m
English
REINFORCE is a Monte Carlo policy gradient method. It is Monte Carlo in the sense that it updates the policy by playing full environment episodes, in the same way as the Monte Carlo value-approximation methods we described in Chapter 8, Reinforcement Learning Theory. Once an episode finishes, REINFORCE updates the policy parameters θ for each step t of the episode trajectory with the following rule:

Where α is the learning rate and Gt is the total discounted reward at time t. But, let's discuss the last element of the equation. We divide (the gradient of the probability of taking action at, given ...