December 2018
Beginner to intermediate
684 pages
21h 9m
English
MDPs proceed in the following fashion: at each step, t, the agent observes the environment's state, St ∈ S, and selects an action, At ∈ A, where S and A are the sets of states and actions, respectively. At the next time step, t+1, the agent receives a reward, Rt+1∈ R, and transitions to the state, St+1. Over time, the MDP gives rise to a trajectory, S0, A0, R1, S1, A1, R1, ..., that continues until the agent reaches a terminal state and the episode ends.
Finite MDPs with a limited number of actions, A, states, S, and rewards, R, include well-defined discrete probability distributions over these elements. Due to the Markov property, these distributions only depend on the previous state and action. ...