Chapter 6

Online Resolution Techniques 1

6.1. Introduction

We have seen in previous chapters how to approximately solve large MDPs using various techniques based on a parameterized or structured representation of policies and/or value functions, and on the use of simulation for reinforcement learning (RL) techniques. The optimization process returns an approximate optimal policy images that is valid for the whole state space. For very large MDPs, obtaining a good approximation is often difficult, and is all the more difficult that, in general, we do not know how to precisely quantify the policy’s sub-optimality a priori. A possible improvement then consists of considering these optimization methods as an offline pre-computation. During a second phase, online, the a priori policy is improved by a non-elementary computation for each encountered state.

6.1.1. Exploiting time online

In the framework of MDPs, the algorithm used to determine the current action online is generally very simple. For example, when images is defined through a value function images, this algorithm is a simple comparison of the actions’ values “at one step” (see Algorithm 1.3 in Chapter 1). Similarly, in the case of a parameterized ...

Get Markov Decision Processes in Artificial Intelligence now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.