The critic
The output of the critic is the estimate of the action-value function Q (s,a), and as such you might see the critic network sometimes called the action-value function approximator. Its job is to help the actor appropriately approximate the action-value function.
The critic model works very similarly to the Q-function approximator that we saw in Chapter 10, Deep Learning for Game Playing. The critic produces a temporal-difference (TD) error, which it uses to update its gradients. The TD error helps the algorithm reduce the variance that occurs from trying to make predictions off of highly correlated data. DDPG utilizes a target network, just as the Deep Q-network did in Chapter 10, Deep Learning for Game Playing, only the targets ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access