To find the optimal policy, you first need to find the optimal value function. An iterative procedure that does this is called policy evaluation—it creates a sequence that iteratively improves the value function for a policy, , using the state value transition of the model, the expectation of the next state, and the immediate reward. Therefore, it creates a sequence of improving value functions using the Bellman equation:
This sequence will converge to the optimal value as . Figure 3.3 shows the update ...