Appendix A. The Gradient of a Logistic Policy for Two Actions

Equation 5-6 is a policy for two actions. To update the policies I need to calculate the gradient of the natural logarithm of the policy (see Equation 5-4). I present this in Equation A-1. You can perform the differentiation in a few different ways depending on how you refactor it, so the result can look different, even though it provides the same result.

Equation A-1. Logistic policy gradient for two actions

\nabla ln π (a ∣ s, θ) = (\begin{matrix} \frac{δ}{δ θ_{0}} ln (\frac{1}{1 + e^{- θ_{0}^{⊺} s}}) \\ \frac{δ}{δ θ_{1}} ln (1 - \frac{1}{1 + e^{- θ_{1}^{⊺} s}}) \end{matrix})

I calculate the gradients of each action independently and I find it easier if I refactor the logistic function like in Equation A-2.

Equation A-2. Refactoring the logistic function

\begin{matrix} π (x) ≐ \frac{1}{1 + e^{- x}} & = \frac{e^{x}}{e^{x} (1 + e^{- x})} \\ = \frac{e^{x}}{e^{x} (1 + e^{- x})} \\ = \frac{e^{x}}{e^{x} + e^{x} e^{- x}} \\ = \frac{e^{x}}{e^{x} + e^{x - x}} \\ = \frac{e^{x}}{e^{x} + e^{0}} \\ = \frac{e^{x}}{e^{x} + 1} \\ = \frac{e^{x}}{1 + e^{x}} \end{matrix}

The derivative of the refactored logistic function, for action 0, is shown in Equation A-3.

Equation A-3. Differentiation of action 0

\begin{matrix} \frac{δ}{δ θ_{0}} ln π_{0} (θ_{0}^{⊺} s) & = \frac{δ}{δ θ_{0}} ln (\frac{e^{θ_{0}^{⊺} s}}{1 + e^{θ_{0}^{⊺} s}}) \\ = \frac{δ}{δ θ_{0}} ln e^{θ_{0}^{⊺} s} - \frac{δ}{δ θ_{0}} ln (1 + e^{θ_{0}^{⊺} s}) \\ = \frac{δ}{δ θ_{0}} θ_{0}^{⊺} s - \frac{δ}{δ θ_{0}} ln (1 + e^{θ_{0}^{⊺} s}) \\ = s - \frac{δ}{δ θ_{0}} ln (1 + e^{θ_{0}^{⊺} s}) \\ = s - \frac{δ}{δ θ_{0}} ln u where u = 1 + e^{θ_{0}^{⊺} s} \\ = s - \frac{1}{u} \frac{δ}{δ θ_{0}} u \\ = s - \frac{1}{1 + e^{θ_{0}^{⊺} s}} \frac{δ}{δ θ_{0}} (1 + e^{θ_{0}^{⊺} s}) \end{matrix}

Get Reinforcement Learning now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Reinforcement Learning by Phil Winder

Appendix A. The Gradient of a Logistic Policy for Two Actions

Equation A-1. Logistic policy gradient for two actions

Equation A-2. Refactoring the logistic function

Equation A-3. Differentiation of action 0

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly