In Chapter 1, we presented planning methods in which the agent knows the transition and reward functions of the Markov decision problem it faces. In this chapter, we present reinforcement learning methods, where the transition and reward functions are not known in advance.
EXAMPLE 2.1. Let us consider again the case of a car used in the introduction of the previous chapter (see section 1.1). If we must look after a type of car never encountered before and do not possess the corresponding manual, we cannot directly model the problem as an MDP. We must first determine the probability of each breakdown, the cost of each repair operation and so on. In such a case, reinforcement learning is a way to determine through incremental experience the best way to look after the car by trial and error, eventually without even determining explicitly all probabilities and costs, just relying on a locally updated value of each action in each situation.
Our presentation strongly relies on Sutton and Barto’s book [SUT 98]. However, before presenting the main concepts of reinforcement learning, we give a brief overview of the successive stages of research that led to the current formal understanding of the domain from the computer science viewpoint.
Most reinforcement learning methods rely on simple principles coming from the study of animal or human cognition, such as the increased tendency to perform an action in a context ...