A topic in the Open Knowledge Graph — a free, open map of 15,290 topics and the order to learn them in.

Bandit Problems (Multi-Armed Bandits)

Research Depth 99 in the knowledge graph ☐ I know this ☆ Set as goal

689prerequisites beneath it

Concentration Inequalities Online Learning and Regret Bounds +2 more→

Core Idea

The multi-armed bandit problem is an online learning setting with partial feedback: after choosing an action (pulling an arm), the learner observes only the reward of the chosen arm, not the rewards of the other arms. This creates the exploration-exploitation dilemma — the learner must try different arms to estimate their rewards (explore) while also pulling the arm it believes is best to accumulate reward (exploit). The UCB (Upper Confidence Bound) algorithm achieves O(sqrt(KT * ln T)) regret over T rounds with K arms by choosing the arm with the highest upper confidence bound on its mean reward. The minimax optimal regret for K-armed bandits is Theta(sqrt(KT)), showing that partial feedback costs a sqrt(K) factor compared to the full-information expert setting.

Explainer

The multi-armed bandit problem, named after a gambler facing a row of slot machines (one-armed bandits), captures a dilemma that pervades sequential decision-making: you must balance exploring options you know little about against exploiting the option you currently believe is best. The partial-feedback structure — you only learn the outcome of actions you take — is what makes bandits fundamentally different from (and harder than) the full-information expert problem.

In the stochastic setting, each arm i has an unknown mean reward mu_i, and pulling arm i yields a random reward drawn from a distribution with mean mu_i. The UCB1 algorithm handles this elegantly through the principle of "optimism in the face of uncertainty." At each round t, it pulls the arm maximizing mean_hat_i + sqrt(2 * ln(t) / n_i), where mean_hat_i is the empirical mean reward of arm i and n_i is the number of times it has been pulled. The confidence bonus sqrt(2 * ln(t) / n_i) is wide for under-explored arms and narrow for well-explored ones. This automatically balances exploration and exploitation: uncertain arms get a generous bonus that ensures they are tried, while well-understood suboptimal arms have their bonus shrink below the best arm's empirical mean.

UCB1 achieves regret O(sqrt(KT * ln T)), which is near-optimal — the minimax lower bound is Theta(sqrt(KT)). The logarithmic factor can be removed with more sophisticated algorithms. In the adversarial setting (where rewards are chosen by an adversary, not drawn from fixed distributions), the EXP3 algorithm extends the Hedge approach using importance-weighted reward estimates. Since the learner only observes the reward of the chosen arm, EXP3 constructs unbiased estimates by dividing the observed reward by the probability of having chosen that arm. These estimates are noisier than full-information observations, which is the fundamental reason adversarial bandit regret is sqrt(KT) rather than the expert setting's sqrt(T ln K).

The bandit framework has enormous practical reach. Clinical trials must balance testing treatments against assigning patients to the best-known treatment. Online advertising must decide which ad to show while learning click-through rates. Recommendation systems must balance showing familiar content against discovering new preferences. In each case, the partial-feedback structure and the exploration-exploitation tradeoff are the core challenges. Thompson sampling (a Bayesian approach that samples from the posterior distribution of each arm's mean) has emerged as a practical favorite, often matching UCB's theoretical guarantees with better empirical performance and more natural uncertainty quantification through the Bayesian prior.

Practice Questions 4 questions

Prerequisite Chain

Understanding Zero → The Number Zero → Counting to Five → Counting to 10 → Counting to 20 → Counting a Set of Objects Up to 20 → Cardinality: The Last Number Counted → Matching Numerals to Quantities → Subitizing Small Quantities → Addition Within 10 → Number Bonds to 10 → Addition Within 20 → Doubles and Near Doubles → Doubles Facts Within 10 → Near Doubles Facts Within 20 → Mental Math Strategies for Addition → Mental Math: Adding and Subtracting Tens → Addition Within 100 → Repeated Addition as Multiplication → Multiplication as Equal Groups → Multiplication: Arrays → Basic Multiplication Facts (0s, 1s, 2s, 5s, 10s) → Multiplication Facts Within 100 → Division as Equal Sharing → Division as Grouping (Measurement Division) → Division: Grouping (Repeated Subtraction) Model → Division: Fair Sharing Model → Division as Equal Sharing → Division as Grouping → Basic Division Facts → Division Facts Within 100 → Multiplication and Division Fact Families → Relationship Between Multiplication and Division → Division Facts as Inverse of Multiplication → Remainders and Quotients in Division → Division Word Problems → Multi-Step Word Problems → Solving Multi-Step Word Problems → Multiplication Word Problems → Division Word Problems → Introduction to Long Division → Factors and Multiples → Prime and Composite Numbers → Equivalent Fractions → Relating Fractions and Decimals → Decimal Place Value → Integers and the Number Line → Comparing and Ordering Integers → Absolute Value → Adding Integers → Subtracting Integers → Multiplying Integers → Dividing Integers → Unit Rates → Proportions → Percent Concept → Converting Between Fractions, Decimals, and Percents → Operations with Rational Numbers → Two-Step Equations → Solving Multi-Step Equations → Equations with Variables on Both Sides → Angle Pairs: Complementary, Supplementary, and Vertical → Parallel Lines and Transversals → Corresponding Angles → Alternate Interior Angles → Triangle Angle Sum Theorem → Exterior Angle Theorem → Triangle Inequality Theorem → Similar Triangles: AA Similarity → Similar Triangles: SSS and SAS Similarity → Proportions in Similar Triangles → Right Triangle Trigonometry Introduction → Sine, Cosine, and Tangent Ratios → Trigonometric Ratios Review → Radian Measure → Converting Between Degrees and Radians → The Unit Circle → Graphing Sine and Cosine → Graphing Tangent and Reciprocal Trigonometric Functions → Derivatives of Trigonometric Functions → Antiderivatives → Indefinite Integrals → Basic Integration Rules → Riemann Sums → Definite Integral Definition → Probability Density Functions and Continuous Distributions → Cumulative Distribution Functions → Continuous Random Variables → Probability Density Functions → Expected Value → Linear Regression in Machine Learning → Neural Network Fundamentals → Backpropagation Algorithm → Multilayer Perceptrons (MLPs) → Activation Functions in Neural Networks → Vanishing Gradient Problem → Gradient Descent and Optimization → Convex Optimization Fundamentals → Online Learning and Regret Bounds → Bandit Problems (Multi-Armed Bandits)

Longest path: 100 steps · 689 total prerequisite topics

Prerequisites (4)

Online Learning and Regret Boundshard Concentration Inequalitieshard Expected Valuesoft Bayes' Theoremsoft

Leads To (0)

No topics depend on this one yet.