To test this algorithm, we need to set an initial value matrix with all values equal to 0 (they can be also randomly chosen but, as we don't have any prior information on the final configuration, every initial choice is probabilistically equivalent):
import numpy as nptunnel_values = np.zeros(shape=(height, width))
At this point, we can define the two functions to perform the value evaluation and the final policy selection (the function is_final() is the one defined in the previous example):
import numpy as npdef value_evaluation(): old_tunnel_values = tunnel_values.copy() for i in range(height): for j in range(width): rewards = np.zeros(shape=(nb_actions, )) old_values = np.zeros(shape=(nb_actions, ...