Chapter 6. UCB – The Upper Confidence Bound Algorithm
Introducing the UCB Algorithm
The algorithms we’ve presented so far have one systematic weakness: they don’t keep track of how much they know about any of the arms available to them. They pay attention only to how much reward they’ve gotten from the arms. This means that they’ll underexplore options whose initial experiences were not rewarding, even though they don’t have enough data to be confident about those arms. We can do better by using an algorithm that pays attention to not only what it knows, but also how much it knows.
The algorithm, UCB, that we’ll present in this chapter does exactly this. Before we describe how the UCB algorithm keeps track of how much it knows, let’s look back at the epsilon-Greedy and Softmax algorithms and take a more abstract perspective on them. Both the epsilon-Greedy algorithm and the Softmax algorithm share the following broad properties:
- The algorithm’s default choice is to select the arm that currently has the highest estimated value.
The algorithm sometimes decides to explore and chooses an option that isn’t the one that currently seems best:
The epsilon-Greedy algorithm explores by selecting from all of the arms completely at random. It makes one of these random exploratory decisions with probability
- The Softmax algorithm explores by randomly selecting from all of the available arms with probabilities that are more-or-less proportional to the estimated value of each of the arms. ...
- The epsilon-Greedy algorithm explores by selecting from all of the arms completely at random. It makes one of these random exploratory decisions with probability