In value iteration, we start off with a random value function. Obviously, the random value function might not be an optimal one, so we look for a new improved value function in iterative fashion until we find the optimal value function. Once we find the optimal value function, we can easily derive an optimal policy from it:
The steps involved in the value iteration are as follows:
- First, we initialize the random value function, that is, the random value for each state.
- Then we compute the Q function for all state action pairs of Q(s, a).
- Then we update our value function with the max value from Q(s,a).
- We repeat these steps ...