BoltzmannQPolicy
In the exploration, we would like to exploit all the information present in the estimated Q values produced by our network. The Boltzmann exploration does this. Instead of always taking a random or optimal action, this approach involves choosing an action with weighted probabilities. To accomplish this, it uses a softmax over the networks estimates of value for each action. In this case, the action that the agent estimates to be the optimal one is most likely (but not guaranteed) to be chosen. The biggest advantage over the e-greedy algorithm is that information about the likely value of the other actions can also be taken into consideration. If there are four actions available to an agent, in e-greedy the three actions estimated ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access