September 2018
Intermediate to advanced
288 pages
7h 38m
English
The State-action-reward-state-action (SARSA) algorithm implements an on-policy time differences method, in which the update of the action-value function is performed based on the outcome of the transition from state s to state s' through action a, based on a selected policy π (s, a).
There are policies that always choose the action that provides the maximum reward and non-deterministic policies (ε-greedy, ε-soft, softmax), which ensure an element of exploration in the learning phase.
In SARSA, it is necessary to estimate the action-value function q (s, a), because the total value of a state v (s) (value function) ...
Read now
Unlock full access