September 2018
Intermediate to advanced
296 pages
9h 10m
English
Based on the preceding policy improvement bound, the following algorithm is developed:
Initialize policy;Repeat for each step
: Compute all advantage values
; Solve the following optimization problem:
;Until convergence
In each step, this algorithm minimizes the upper bound of , so that:
The last equation follows from that for any ...
Read now
Unlock full access