Let's first see how to build a random policy using a simple fully connected (dense) neural network, which takes 4 values in an observation as input, uses a hidden layer of 4 neurons, and outputs the probability of the 0 action, based on which, the agent can sample the next action between 0 and 1:
# nn_random_policy.pyimport tensorflow as tfimport numpy as npimport gymenv = gym.make("CartPole-v0") num_inputs = env.observation_space.shape[0]inputs = tf.placeholder(tf.float32, shape=[None, num_inputs])hidden = tf.layers.dense(inputs, 4, activation=tf.nn.relu)outputs = tf.layers.dense(hidden, 1, activation=tf.nn.sigmoid)action = tf.multinomial(tf.log(tf.concat([outputs, 1-outputs], 1)), 1)with tf.Session() ...