Appendix A. The Gradient of a Logistic Policy for Two Actions
Equation 5-6 is a policy for two actions. To update the policies I need to calculate the gradient of the natural logarithm of the policy (see Equation 5-4). I present this in Equation A-1. You can perform the differentiation in a few different ways depending on how you refactor it, so the result can look different, even though it provides the same result.
Equation A-1. Logistic policy gradient for two actions
I calculate the gradients of each action independently and I find it easier if I refactor the logistic function like in Equation A-2.
Equation A-2. Refactoring the logistic function
The derivative of the refactored logistic function, for action 0, is shown in Equation A-3.
Equation A-3. Differentiation of action 0
Get Reinforcement Learning now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.