We can now test the SARSA algorithm in the original tunnel environment (all of the elements that are not redefined are the same as the previous chapter). The first step is defining the Q(s, a) array and the constants employed in the training process:
import numpy as npnb_actions = 4Q = np.zeros(shape=(height, width, nb_actions))x_start = 0y_start = 0max_steps = 2000alpha = 0.25
As we want to employ a ε-greedy policy, we can set the starting point to (0, 0), forcing the agent to reach the positive final state. We can now define the functions needed to perform a training step:
import numpy as npdef is_final(x, y): if (x, y) in zip(x_wells, y_wells) or (x, y) == (x_final, y_final): return True return False ...