The architecture of asynchronous n-step Q-learning is, to an extent, similar to that of asynchronous one-step Q-learning. The difference is that the learning agent actions are selected using the exploration policy for up to steps or until a terminal state is reached, in order to compute a single update of policy network parameters. This process lists rewards from the environment since its last update. Then, for each time step, the loss is calculated as the difference between the discounted future rewards at that ...
Asynchronous n-step Q-learning
Get Reinforcement Learning with TensorFlow now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.