Policy evaluation example

To better understand this, let's use an example. Imagine that we have a simple robot, navigating a grid environment (this example is also known as gridworld). We'll assume that:

  • The grid is size 4 x 4. It's very similar to the maze example we defined earlier, with the exception that it has no walls. The cells are numbered from 1 to 16, where cells 1 and 16 are terminal states.
  • The robot can navigate up, down, left, or right to any of the neighboring states. Actions that take the robot off the grid leave it in its current state (but the reward is still received). 
  • The environment is deterministic  that is, the transition probability of moving to the corresponding neighbor state when taking an action is always 1. ...

Get Python Deep Learning - Second Edition now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.