October 2019
Intermediate to advanced
366 pages
12h 4m
English
When running a policy in an MDP, the sequence of state and action (S0, A0, S1, A1, ...) is called trajectory or rollout, and is denoted by
. In each trajectory, a sequence of rewards will be collected as a result of the actions. A function of these rewards is called return and in its most simplified version, it is defined as follows:

At this point, the return can be analyzed separately for trajectories with infinite and finite horizons. This distinction is needed because in the case of interactions within an environment that do not terminate, ...
Read now
Unlock full access