October 2018
Intermediate to advanced
368 pages
9h 20m
English
To illustrate the Q-Learning algorithm, we need to consider a simple deterministic environment, as shown in the following figure. The environment has six states. The rewards for allowed transitions are shown. The reward is non-zero in two cases. Transition to the Goal (G) state has +100 reward while moving into Hole (H) state has -100 reward. These two states are terminal states and constitute the end of one episode from the Start state:

Figure 9.3.1: Rewards in a simple deterministic world
To formalize the identity of each state, we need to use a (row, column) identifier as shown in the following figure. Since the agent has not ...