Policy gradient ascent

The basic intuition behind PG methods is we move from finding a value function that describes a deterministic policy to a stochastic policy with parameters used to define a policy distribution. Thinking this way, we can now assume that our policy function needs to be defined so that our policy, π, can be set by adjusting parameters θ so that we understand the probability of taking a given action in a state. Mathematically, we can simply define this like so:

You should consider the mathematics we cover in this chapter the minimum you need to understand the code. If you are indeed serious about developing your own extensions ...

