December 2018
Beginner to intermediate
684 pages
21h 9m
English
First, we create the value iteration algorithm, which is slightly simpler because it implements policy evaluation and improvement in a single step. We capture the states for which we need to update the value function, excluding terminal states that have a value of 0 for lack of rewards (+1/-1 are assigned to the starting state), and skip the blocked cell:
skip_states = list(absorbing_states.keys())+[blocked_state]states_to_update = [s for s in states if s notin skip_states]
Then, we initialize the value function and set the discount factor gamma and the convergence threshold epsilon:
V = np.random.rand(num_states)V[skip_states] = 0gamma = .99epsilon = 1e-5
The algorithm updates the value function using the Bellman optimality ...