In policy iteration, first we initialize a random policy. Then we will evaluate the random policies we initialized: are they good or not? But how can we evaluate the policies? We will evaluate our randomly initialized policies by computing value functions for them. If they are not good, then we find a new policy. We repeat this process until we find a good policy.
Now let us see how to solve the frozen lake problem using policy iteration.
Before looking at policy iteration, we will see how to compute a value function, given a policy.
We initialize value_table as zero with the number of states:
value_table = np.zeros(env.nS)
Then, for each state, we get the action from the policy, and we compute the value function according ...