Based on the preceding policy improvement bound, the following algorithm is developed:
Initialize policy ;Repeat for each step : Compute all advantage values ; Solve the following optimization problem: ;Until convergence
In each step, this algorithm minimizes the upper bound of , so that:
The last equation follows from that for any ...