At this point, we can test the TD(0) algorithm on the checkerboard environment. The first step is to define an initial random policy and a value matrix with all elements equal to 0:
import numpy as nppolicy = np.random.randint(0, nb_actions, size=(height, width)).astype(np.uint8)tunnel_values = np.zeros(shape=(height, width))
As we want to select a random starting point at the beginning of each episode, we need to define a helper function that must exclude the terminal states (all the constants are the same as previously defined):
import numpy as npxy_grid = np.meshgrid(np.arange(0, height), np.arange(0, width), sparse=False)xy_grid = np.array(xy_grid).T.reshape(-1, 2)xy_final = list(zip(x_wells, y_wells)) ...