# Appendix A. The Gradient of a Logistic Policy for Two Actions

Equation 5-6 is a policy for two actions. To update the policies I need to calculate the gradient of the natural logarithm of the policy (see Equation 5-4). I present this in Equation A-1. You can perform the differentiation in a few different ways depending on how you refactor it, so the result can look different, even though it provides the same result.

##### Equation A-1. Logistic policy gradient for two actions

$$\nabla ln\pi (a\mid s,\theta )=\left(\begin{array}{c}\frac{\delta}{\delta {\theta}_{0}}ln\left(\frac{1}{1+{e}^{-{\theta}_{0}^{\u22ba}s}}\right)\\ \frac{\delta}{\delta {\theta}_{1}}ln\left(1-\frac{1}{1+{e}^{-{\theta}_{1}^{\u22ba}s}}\right)\end{array}\right)$$I calculate the gradients of each action independently and I find it easier if I refactor the logistic function like in Equation A-2.

##### Equation A-2. Refactoring the logistic function

$$\begin{array}{cc}\hfill \pi \left(x\right)\doteq \frac{1}{1+{e}^{-x}}& =\frac{{e}^{x}}{{e}^{x}(1+{e}^{-x})}\hfill \\ =\frac{{e}^{x}}{{e}^{x}(1+{e}^{-x})}\hfill \\ =\frac{{e}^{x}}{{e}^{x}+{e}^{x}{e}^{-x}}\hfill \\ =\frac{{e}^{x}}{{e}^{x}+{e}^{x-x}}\hfill \\ =\frac{{e}^{x}}{{e}^{x}+{e}^{0}}\hfill \\ =\frac{{e}^{x}}{{e}^{x}+1}\hfill \\ =\frac{{e}^{x}}{1+{e}^{x}}\hfill \end{array}$$The derivative of the refactored logistic function, for action 0, is shown in Equation A-3.

##### Equation A-3. Differentiation of action 0

$$\begin{array}{cc}\hfill \frac{\delta}{\delta {\theta}_{0}}ln{\pi}_{0}\left({\theta}_{0}^{\u22ba}s\right)& =\frac{\delta}{\delta {\theta}_{0}}ln\left(\frac{{e}^{{\theta}_{0}^{\u22ba}s}}{1+{e}^{{\theta}_{0}^{\u22ba}s}}\right)\hfill \\ =\frac{\delta}{\delta {\theta}_{0}}ln{e}^{{\theta}_{0}^{\u22ba}s}-\frac{\delta}{\delta {\theta}_{0}}ln(1+{e}^{{\theta}_{0}^{\u22ba}s})\hfill \\ =\frac{\delta}{\delta {\theta}_{0}}{\theta}_{0}^{\u22ba}s-\frac{\delta}{\delta {\theta}_{0}}ln(1+{e}^{{\theta}_{0}^{\u22ba}s})\hfill \\ =s-\frac{\delta}{\delta {\theta}_{0}}ln(1+{e}^{{\theta}_{0}^{\u22ba}s})\hfill \\ =s-\frac{\delta}{\delta {\theta}_{0}}lnu\text{where}u=1+{e}^{{\theta}_{0}^{\u22ba}s}\hfill \\ =s-\frac{1}{u}\frac{\delta}{\delta {\theta}_{0}}u\hfill \\ =s-\frac{1}{1+{e}^{{\theta}_{0}^{\u22ba}s}}\frac{\delta}{\delta {\theta}_{0}}(1+{e}^{{\theta}_{0}^{\u22ba}s})\hfill \\ \hfill \end{array}$$Get *Reinforcement Learning* now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.