Appendix A. The Gradient of a Logistic Policy for Two Actions

Equation 5-6 is a policy for two actions. To update the policies I need to calculate the gradient of the natural logarithm of the policy (see Equation 5-4). I present this in Equation A-1. You can perform the differentiation in a few different ways depending on how you refactor it, so the result can look different, even though it provides the same result.

Equation A-1. Logistic policy gradient for two actions
ln π ( a s , θ ) = δ δθ 0 ln 1 1+e -θ 0 s δ δθ 1 ln 1 - 1 1+e -θ 1 s

I calculate the gradients of each action independently and I find it easier if I refactor the logistic function like in Equation A-2.

Equation A-2. Refactoring the logistic function
π ( x ) 1 1+e -x = e x e x (1+e -x ) = e x e x (1+e -x ) = e x e x +e x e -x = e x e x +e x-x = e x e x +e 0 = e x e x +1 = e x 1+e x

The derivative of the refactored logistic function, for action 0, is shown in Equation A-3.

Equation A-3. Differentiation of action 0
δ δθ 0 ln π 0 ( θ 0 s ) = δ δθ 0 ln e θ 0 s 1+e θ 0 s = δ δθ 0 ln e θ 0 s - δ δθ 0 ln ( 1 + e θ 0 s ) = δ δθ 0 θ 0 s - δ δθ 0 ln ( 1 + e θ 0 s ) = s - δ δθ 0 ln ( 1 + e θ 0 s ) = s - δ δθ 0 ln u where u = 1 + e θ 0 s = s - 1 u δ δθ 0 u = s - 1 1+e θ 0 s δ δθ 0 ( 1 + e θ 0 s )

Get Reinforcement Learning now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.