June 2018
Intermediate to advanced
546 pages
13h 30m
English
The formula of PG that we’ve just seen is used by most of the policy-based methods, but the details can vary. One very important point is how exactly gradient scales Q(s, a) are calculated. In the cross-entropy method from Chapter 4, The Cross-Entropy Method, we played several episodes, calculated the total reward for each of them, and trained on transitions from episodes with a better-than-average reward. This training procedure is the PG method with Q(s, a) = 1 for actions from good episodes (with large total reward) and Q(s, a) = 0 for actions from worse episodes.
The cross-entropy method worked even with those simple assumptions, but the obvious improvement will be to use Q(s, a) for training instead of just 0 and 1. So ...