Policy iteration

Policy iteration is a dynamic programming algorithm that uses a value function to model the expected return for each pair of action-states. Many techniques in reinforcement learning are based on this technique, including Q-learning, TD-learning, SARSA, QV-learning, and more. These techniques update the value functions using the immediate reward and the (discounted) value of the next state in a process called bootstrap. Therefore, they imply the storage of Q (s, a) in tables or with approximate function techniques.

Policy iteration is generally applied to discrete Markov decision processes, where both the state space S and the action space A are discrete and finite sets.

Starting from an initial P0 policy, the iteration of ...

Get Hands-On Reinforcement Learning with R now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.