Online Resolution Techniques 1
We have seen in previous chapters how to approximately solve large MDPs using various techniques based on a parameterized or structured representation of policies and/or value functions, and on the use of simulation for reinforcement learning (RL) techniques. The optimization process returns an approximate optimal policy that is valid for the whole state space. For very large MDPs, obtaining a good approximation is often difficult, and is all the more difficult that, in general, we do not know how to precisely quantify the policy’s sub-optimality a priori. A possible improvement then consists of considering these optimization methods as an offline pre-computation. During a second phase, online, the a priori policy is improved by a non-elementary computation for each encountered state.
6.1.1. Exploiting time online
In the framework of MDPs, the algorithm used to determine the current action online is generally very simple. For example, when is defined through a value function , this algorithm is a simple comparison of the actions’ values “at one step” (see Algorithm 1.3 in Chapter 1). Similarly, in the case of a parameterized ...