DDPG uses two key ideas, both borrowed from DQN but adapted for the actor-critic case:
- Replay buffer: All the transitions acquired during the lifetime of the agent are stored in a replay buffer, also called experienced replay. Then, this is used for training the actor and the critic by sampling mini-batches from it.
- Target network: Q-learning is unstable, since the network that is updated is also the one that is used for computing the target values. If you remember, DQN mitigates this problem by employing a target network that is updated every N iterations (copying the parameters of the online network in the target network). In the DDQN paper, they show that a soft target update works better in this context. With a soft ...