# Appendix A. The Gradient of a Logistic Policy for Two Actions

Equation 5-6 is a policy for two actions. To update the policies I need to calculate the gradient of the natural logarithm of the policy (see Equation 5-4). I present this in Equation A-1. You can perform the differentiation in a few different ways depending on how you refactor it, so the result can look different, even though it provides the same result.

##### Equation A-1. Logistic policy gradient for two actions
$normal nabla ln pi left-parenthesis a bar s comma theta right-parenthesis equals StartBinomialOrMatrix StartFraction delta Over delta theta 0 EndFraction ln left-parenthesis StartFraction 1 Over 1 plus e Superscript minus theta 0 Super Superscript intercalate Superscript s Baseline EndFraction right-parenthesis Choose StartFraction delta Over delta theta 1 EndFraction ln left-parenthesis 1 minus StartFraction 1 Over 1 plus e Superscript minus theta 1 Super Superscript intercalate Superscript s Baseline EndFraction right-parenthesis EndBinomialOrMatrix$

I calculate the gradients of each action independently and I find it easier if I refactor the logistic function like in Equation A-2.

##### Equation A-2. Refactoring the logistic function
$StartLayout 1st Row 1st Column pi left-parenthesis x right-parenthesis approaches-the-limit StartFraction 1 Over 1 plus e Superscript negative x Baseline EndFraction 2nd Column equals StartFraction e Superscript x Baseline Over e Superscript x Baseline left-parenthesis 1 plus e Superscript negative x Baseline right-parenthesis EndFraction 2nd Row 1st Column Blank 2nd Column equals StartFraction e Superscript x Baseline Over e Superscript x Baseline left-parenthesis 1 plus e Superscript negative x Baseline right-parenthesis EndFraction 3rd Row 1st Column Blank 2nd Column equals StartFraction e Superscript x Baseline Over e Superscript x Baseline plus e Superscript x Baseline e Superscript negative x Baseline EndFraction 4th Row 1st Column Blank 2nd Column equals StartFraction e Superscript x Baseline Over e Superscript x Baseline plus e Superscript x minus x Baseline EndFraction 5th Row 1st Column Blank 2nd Column equals StartFraction e Superscript x Baseline Over e Superscript x Baseline plus e Superscript 0 Baseline EndFraction 6th Row 1st Column Blank 2nd Column equals StartFraction e Superscript x Baseline Over e Superscript x Baseline plus 1 EndFraction 7th Row 1st Column Blank 2nd Column equals StartFraction e Superscript x Baseline Over 1 plus e Superscript x Baseline EndFraction EndLayout$

The derivative of the refactored logistic function, for action 0, is shown in Equation A-3.

##### Equation A-3. Differentiation of action 0
$StartLayout 1st Row 1st Column StartFraction delta Over delta theta 0 EndFraction ln pi 0 left-parenthesis theta 0 Superscript intercalate Baseline s right-parenthesis 2nd Column equals StartFraction delta Over delta theta 0 EndFraction ln left-parenthesis StartFraction e Superscript theta 0 Super Superscript intercalate Superscript s Baseline Over 1 plus e Superscript theta 0 Super Superscript intercalate Superscript s Baseline EndFraction right-parenthesis 2nd Row 1st Column Blank 2nd Column equals StartFraction delta Over delta theta 0 EndFraction ln e Superscript theta 0 Super Superscript intercalate Superscript s Baseline minus StartFraction delta Over delta theta 0 EndFraction ln left-parenthesis 1 plus e Superscript theta 0 Super Superscript intercalate Superscript s Baseline right-parenthesis 3rd Row 1st Column Blank 2nd Column equals StartFraction delta Over delta theta 0 EndFraction theta 0 Superscript intercalate Baseline s minus StartFraction delta Over delta theta 0 EndFraction ln left-parenthesis 1 plus e Superscript theta 0 Super Superscript intercalate Superscript s Baseline right-parenthesis 4th Row 1st Column Blank 2nd Column equals s minus StartFraction delta Over delta theta 0 EndFraction ln left-parenthesis 1 plus e Superscript theta 0 Super Superscript intercalate Superscript s Baseline right-parenthesis 5th Row 1st Column Blank 2nd Column equals s minus StartFraction delta Over delta theta 0 EndFraction ln u where u equals 1 plus e Superscript theta 0 Super Superscript intercalate Superscript s Baseline 6th Row 1st Column Blank 2nd Column equals s minus StartFraction 1 Over u EndFraction StartFraction delta Over delta theta 0 EndFraction u 7th Row 1st Column Blank 2nd Column equals s minus StartFraction 1 Over 1 plus e Superscript theta 0 Super Superscript intercalate Superscript s Baseline EndFraction StartFraction delta Over delta theta 0 EndFraction left-parenthesis 1 plus e Superscript theta 0 Super Superscript intercalate Superscript s Baseline right-parenthesis 8th Row 1st Column Blank 2nd Column equals s minus StartFraction 1 Over 1 plus e Superscript theta 0 Super Superscript intercalate Superscript s Baseline EndFraction StartFraction delta Over delta v EndFraction left-parenthesis e Superscript v Baseline right-parenthesis StartFraction delta Over delta theta 0 EndFraction v where v equals theta 0 Superscript intercalate Baseline s 9th Row 1st Column Blank 2nd Column equals s minus StartFraction 1 Over 1 plus e Superscript theta 0 Super Superscript intercalate Superscript s Baseline EndFraction e Superscript theta 0 Super Superscript intercalate Superscript s Baseline s 10th Row 1st Column Blank 2nd Column equals s minus StartFraction s e Superscript theta 0 Super Superscript intercalate Superscript Baseline s Over 1 plus e Superscript theta 0 Super Superscript intercalate Superscript s Baseline EndFraction 11th Row 1st Column Blank 2nd Column equals s minus s pi left-parenthesis theta 0 Superscript intercalate Baseline s right-parenthesis EndLayout$

Get Reinforcement Learning now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.