December 2018
Beginner to intermediate
684 pages
21h 9m
English
The dynamic and interactive nature of RL implies that the agent estimates the value of states and actions before it has experienced all relevant trajectories. Hence, it is able to take decisions, but these are based on incomplete learning. Decisions that only exploit past (successful) experience, rather than exploring uncharted territory, can limit the agent's exposure and prevent it from learning an optimal policy. An RL algorithm needs to balance this trade-off—too little exploration will likely produce biased value estimates and suboptimal policies, whereas too little exploitation prevents learning from happening in the first place.