Deep Reinforcement Learning Hands-On
by Oleg Vasilev, Maxim Lapan, Martijn van Otterlo, Mikhail Yurushkin, Basem O. F. Alijla
Actor-critic
The next step in reducing the variance is making our baseline state-dependent (which, intuitively, is a good idea, as different states could have very different baselines). Indeed, to decide about the suitability of a particular action in some state, we're using the discounted total reward of the action. However, the total reward itself could be represented as a value of the state plus advantage of the action: Q(s, a) = V(s) + A(s, a). We've seen this in Chapter 7, DQN Extensions, when we discussed DQN modifications, particularly dueling DQN.
So, why can't we use V(s) as a baseline? In that case, the scale of our gradient will be just advantage A(s, a), showing how this taken action is better in respect to the average state's value. ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access