Asynchronous one-step SARSA

The architecture of asynchronous one-step SARSA is almost similar to the architecture of asynchronous one-step Q-learning, except the way target state-action value of the current state is calculated by the target network. Instead of using the maximum Q-value of the next state s' by the target network, SARSA uses -greedy to choose the action a' for the next state s' and the Q-value of the next state action pair, that is, Q(s',a';) is used to calculate the target state-action value of the current state. 

The pseudo-code ...

Get Reinforcement Learning with TensorFlow now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.