Policy-Gradient Algorithms 1
Most MDP solving methods evaluate a value function which makes it possible to identify optimal actions in each state. A first difficulty for these approaches is to handle large size problems. As shown in Chapter 3, a solution is to use methods approximating the value function. But the monotonous policy improvement is not guaranteed anymore for algorithms like value iteration or policy iteration. A second difficulty is to handle partial observations, a setting where the Markov property is not necessarily verified.
A very different approach is to perform a direct policy search, the policy taking the form of a parameterized controller. The choice of this controller makes it possible to adapt to the type of problem being considered and to make a compromise between expressing a large variety of policies and reducing the memory usage. We end up with a parameter optimization problem for which algorithms exist that guarantee a monotonous improvement of the policy towards a local optimum, even - in some cases - under partial observability.
EXAMPLE 5.1. What about the cars that need maintenance then?1 A garage mechanic often knows the procedures (or algorithms) he has to follow to diagnose and repair a vehicle. But these procedures may vary depending on the model or the age of the vehicle. Thus, it can be more cost-effective to change the brake discs without testing them if their age is beyond an age limit. The methods we will discuss here aim at optimizing/learning ...