April 2018
Intermediate to advanced
334 pages
10h 18m
English
In the Monte Carlo policy gradient approach, we update the parameters by the stochastic gradient ascent method, using the update as per policy gradient theorem and
as an unbiased sample of
. Here,
is the cumulative reward from that time-step onward.
The Monte Carlo policy gradient approach is as follows:
Initialize arbitrarilyfor each episode as per the current policy do for step t=1 to T-1 do end forend for ...