In this implementation section of the TRPO algorithm, we'll concentrate our efforts on the computational graph and the steps that are required to optimize the policy. We'll leave out the implementation of other aspects that we looked at in the previous chapters (such as the cycle to gather trajectories from the environment, the conjugate gradient algorithm, and the line search algorithm). However, make sure to check out the full code in this book's GitHub repository. The implementation is for continuous control.
First, let's create all the placeholders and the two deep neural networks for the policy (the actor) and the value function (the critic):
act_ph = tf.placeholder(shape=(None,act_dim), dtype