Chapter 24
Multi-Armed Bandits, Gittins Index, and Its Calculation
24.1 Introduction
Multi-armed bandit is a colorful term that refers to the dilemma faced by a gambler playing in a casino with multiple slot machines (which were colloquially called one-armed bandits). What strategy should a gambler use to pick the machine to play next? It is the one for which the posterior mean of winning is the highest and thereby maximizes current expected reward, or the one for which the posterior variance of winning is the highest, and has the potential to maximize the future expected reward. Similar exploration vs exploitation trade-offs arise in various application domains including clinical trials [5], stochastic scheduling [25], sensor management [33], and economics [4].
Clinical trials fit naturally into the framework of multi-armed bandits and have been a motivation for their study since the early work of Thompson [31]. Broadly speaking, there are two approaches to multi-armed bandits. The first, following Bellman [2], aims to maximize the expected total discounted reward over an infinite horizon. The second, following Robbins [27], aims to minimize the rate of regret for the expected total reward over a finite horizon. In some of the literature, the first setup is called geometric discounting while the second is called uniform discounting. For a large time, the multi-armed bandit problem, in both formulations, was considered unsolvable until the ...
Get Methods and Applications of Statistics in Clinical Trials, Volume 2: Planning, Analysis, and Inferential Methods now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.