2.1 Introduction2.2 -Armed Bandit Problem2.3 The Learning Structure2.4 The Value Function2.5 The Optimal Value Functions2.6 Markov Decision Processes2.7 Learning Value Functions2.8 Policy Iteration2.9 Temporal Difference Learning2.10 TD Learning of the State-Action Function2.11 Q-Learning2.12 Eligibility TracesReferences