April 2018
Intermediate to advanced
334 pages
10h 18m
English
In the vanilla policy gradient approach, the aim would be to update the policy using the policy gradient estimate with better baseline estimation.
Following is the pseudo code to implement the vanilla policy gradient to find the optimal policy:
Initialize: Policy parameter, and baseline bfor iteration = 1,2,......N do Collect a set of trajectories using the current policy At each time step t in each trajectory, compute the following: returns
,and advantage estimate Refit the baseline function end for