April 2018
Intermediate to advanced
334 pages
10h 18m
English
Off-policy learning as the name suggests, is the learning of optimal policy independent of the agent's actions. Therefore, you don't need a specific policy to start with and the agent will learn the optimal policy even by starting with a random action, finally converging to the optimal one. Q-learning is an example of off-policy learning.
On the other hand, on-policy learning learns the optimal policy by carrying out the current policy and updating it through exploration methods. Thus, on-policy learning is dependent on the policy you start with. The SARSA algorithm is an example of on-policy learning.