Policy evaluation and policy improvement

To find the optimal policy, you first need to find the optimal value function. An iterative procedure that does this is called policy evaluation—it creates a sequence that iteratively improves the value function for a policy, , using the state value transition of the model, the expectation of the next state, and the immediate reward. Therefore, it creates a sequence of improving value functions using the Bellman equation:

This sequence will converge to the optimal value as . Figure 3.3 shows the update ...

Get Reinforcement Learning Algorithms with Python now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.