Now, as you may have noticed in the last example, Chapter_8_DDPG.py is using four networks/models to train, using two networks as actors and two as critics, but also using two networks as targets and two as current. This gives us the following diagram:
Each oval in the preceding diagram represents a complete deep learning network. Notice how the critic, the value or Q network implementation, is taking both environment outputs reward and state. The critic then pushes a value back to the actor or policy target network.
Open example Chapter_8_DDPG.py back up and follow the next exercise ...