When running a policy in an MDP, the sequence of state and action (S0, A0, S1, A1, ...) is called trajectory or rollout, and is denoted by . In each trajectory, a sequence of rewards will be collected as a result of the actions. A function of these rewards is called return and in its most simplified version, it is defined as follows:
At this point, the return can be analyzed separately for trajectories with infinite and finite horizons. This distinction is needed because in the case of interactions within an environment that do not terminate, ...