The multi-armed bandit problem is an online learning setting with partial feedback: after choosing an action (pulling an arm), the learner observes only the reward of the chosen arm, not the rewards of the other arms. This creates the exploration-exploitation dilemma — the learner must try different arms to estimate their rewards (explore) while also pulling the arm it believes is best to accumulate reward (exploit). The UCB (Upper Confidence Bound) algorithm achieves O(sqrt(KT * ln T)) regret over T rounds with K arms by choosing the arm with the highest upper confidence bound on its mean reward. The minimax optimal regret for K-armed bandits is Theta(sqrt(KT)), showing that partial feedback costs a sqrt(K) factor compared to the full-information expert setting.
The multi-armed bandit problem, named after a gambler facing a row of slot machines (one-armed bandits), captures a dilemma that pervades sequential decision-making: you must balance exploring options you know little about against exploiting the option you currently believe is best. The partial-feedback structure — you only learn the outcome of actions you take — is what makes bandits fundamentally different from (and harder than) the full-information expert problem.
In the stochastic setting, each arm i has an unknown mean reward mu_i, and pulling arm i yields a random reward drawn from a distribution with mean mu_i. The UCB1 algorithm handles this elegantly through the principle of "optimism in the face of uncertainty." At each round t, it pulls the arm maximizing mean_hat_i + sqrt(2 * ln(t) / n_i), where mean_hat_i is the empirical mean reward of arm i and n_i is the number of times it has been pulled. The confidence bonus sqrt(2 * ln(t) / n_i) is wide for under-explored arms and narrow for well-explored ones. This automatically balances exploration and exploitation: uncertain arms get a generous bonus that ensures they are tried, while well-understood suboptimal arms have their bonus shrink below the best arm's empirical mean.
UCB1 achieves regret O(sqrt(KT * ln T)), which is near-optimal — the minimax lower bound is Theta(sqrt(KT)). The logarithmic factor can be removed with more sophisticated algorithms. In the adversarial setting (where rewards are chosen by an adversary, not drawn from fixed distributions), the EXP3 algorithm extends the Hedge approach using importance-weighted reward estimates. Since the learner only observes the reward of the chosen arm, EXP3 constructs unbiased estimates by dividing the observed reward by the probability of having chosen that arm. These estimates are noisier than full-information observations, which is the fundamental reason adversarial bandit regret is sqrt(KT) rather than the expert setting's sqrt(T ln K).
The bandit framework has enormous practical reach. Clinical trials must balance testing treatments against assigning patients to the best-known treatment. Online advertising must decide which ad to show while learning click-through rates. Recommendation systems must balance showing familiar content against discovering new preferences. In each case, the partial-feedback structure and the exploration-exploitation tradeoff are the core challenges. Thompson sampling (a Bayesian approach that samples from the posterior distribution of each arm's mean) has emerged as a practical favorite, often matching UCB's theoretical guarantees with better empirical performance and more natural uncertainty quantification through the Bayesian prior.
No topics depend on this one yet.