Return

When running a policy in an MDP, the sequence of state and action (S0, A0, S1, A1, ...) is called trajectory or rollout, and is denoted by . In each trajectory, a sequence of rewards will be collected as a result of the actions. A function of these rewards is called return and in its most simplified version, it is defined as follows:

At this point, the return can be analyzed separately for trajectories with infinite and finite horizons. This distinction is needed because in the case of interactions within an environment that do not terminate, ...

Get Reinforcement Learning Algorithms with Python now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.