Chapter 2: Multi-Armed Bandits

When you log on to your favorite social media app, chances are you see one of the many versions of the app that are tested at that time. When you visit a website, the ads displayed to you are tailored to your profile. In many online shopping platforms, the prices are determined dynamically. Do you know what all these have in common? They are often modeled as multi-armed bandit (MAB) problems to identify optimal decisions. A MAB problem is a form of reinforcement learning (RL), where the agent makes decisions in a problem horizon that consists of a single step. Therefore, the goal is to maximize only the immediate reward, and there are no consequences considered for any subsequent steps. While this is a simplification ...

Get Mastering Reinforcement Learning with Python now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.