Reward
At each timestep, that is, after each move of the agent, the environment sends back a number that indicates how good that action was to the agent. This is called a reward. As we have already mentioned, the end goal of the agent is to maximize the cumulative reward obtained during their interaction with the environment.
In literature, the reward is assumed to be a part of the environment, but that's not strictly true in reality. The reward can come from the agent too, but never from the decision-making part of it. For this reason and to simplify the formulation, the reward is always sent from the environment.
The reward is the only supervision signal injected into the RL cycle and it is essential to design the reward in the correct ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access