October 2018
Intermediate to advanced
252 pages
6h 49m
English
In practice, we utilize an additional temperature parameter (τ), which is annealed over time. This parameter controls the spread of the softmax distribution so that all actions are considered equally at the start of training, and actions are sparsely distributed by the end of training.
In mathematical terms, the policy can be written as shown in the following formula:

The following code shows how this policy is initialized:
class BoltzmannQPolicy(Policy): """Implement the Boltzmann Q Policy """ def __init__(self, tau=1., clip=(-500., 500.)): super(BoltzmannQPolicy, self).__init__() self.tau = tau self.clip = clip ...