In the multi-armed bandit setting, why can't you simply use the Hedge algorithm from the full-information expert problem?
AHedge requires continuous loss values, but bandits only produce binary rewards
BHedge updates weights for ALL experts each round, but in bandits you only observe the reward of the chosen arm — you lack the information to update the weights of unchosen arms
CHedge has higher computational complexity than bandit algorithms
DHedge assumes the losses are adversarial, but bandit problems assume stochastic rewards
The fundamental difference between the expert setting and the bandit setting is the information structure. In the expert setting (full information), you observe the loss of every expert each round, allowing you to update all weights. In the bandit setting (partial feedback), you only observe the reward of the arm you pulled. The Hedge update requires knowing all experts' losses — information that is unavailable in the bandit setting. To bridge this gap, bandit algorithms like EXP3 use importance-weighted estimators to construct unbiased estimates of the unseen losses, but these estimates have higher variance, which is why bandit regret is higher (sqrt(KT) vs sqrt(T ln K)).
Question 2 Multiple Choice
The UCB1 algorithm selects the arm with the highest value of (empirical mean + sqrt(2 * ln(t) / n_i)), where n_i is the number of times arm i has been pulled. What does the second term accomplish?
AIt adds random noise to encourage exploration in early rounds
BIt is a confidence bonus that is larger for arms that have been pulled fewer times, ensuring that under-explored arms are given a fair chance — implementing 'optimism in the face of uncertainty'
CIt penalizes arms that have been pulled too often to prevent over-exploitation
DIt corrects for the bias in the empirical mean estimate when sample sizes are small
The sqrt(2 * ln(t) / n_i) term is an upper confidence bound derived from Hoeffding's inequality. For arms that have been pulled many times (large n_i), this term is small — the empirical mean is reliable and exploration is less needed. For rarely-pulled arms (small n_i), the term is large — there is high uncertainty about the arm's true mean, and the algorithm gives it the benefit of the doubt by adding a large bonus. This 'optimism in the face of uncertainty' principle ensures every arm is explored enough to get a reliable estimate, while concentrating pulls on the arm that is genuinely best once estimates become accurate.
Question 3 True / False
In a stochastic K-armed bandit, the minimax optimal regret is Theta(sqrt(KT)), while in the full-information setting with K experts it is Theta(sqrt(T ln K)). The extra sqrt(K/ln K) factor is the price of partial feedback.
TTrue
FFalse
Answer: True
The gap between sqrt(KT) and sqrt(T ln K) is precisely the information cost of the bandit setting. In the full-information setting, you observe all K experts' losses each round, giving K data points per round. In the bandit setting, you observe only 1 reward per round, so after T rounds you have T total observations spread across K arms — roughly T/K per arm. This K-fold information deficit per arm translates to the sqrt(K) factor in the regret. The ln K term in the full-information bound becomes sqrt(K) because you must actually pull each arm to learn about it, rather than passively observing.
Question 4 Short Answer
Explain the exploration-exploitation dilemma in bandits and why it does not arise in the full-information expert setting.
Think about your answer, then reveal below.
Model answer: In bandits, the learner only observes the reward of the chosen arm. To learn about an arm's quality, you must pull it — there is no other way to gain information. But pulling a suboptimal arm incurs opportunity cost (you forgo the reward of the best arm). This creates a dilemma: exploration (pulling arms to learn their rewards) directly conflicts with exploitation (pulling the arm you believe is best). In the full-information expert setting, this dilemma does not exist because you observe all experts' losses every round regardless of which expert you follow. Information about every expert is free — it arrives whether you use that expert or not. The learner never needs to sacrifice reward to gain information, so there is no exploration cost.
The exploration-exploitation dilemma is the defining feature that separates bandits from full-information online learning. Every bandit algorithm must manage this tradeoff, and the different approaches — optimism (UCB), randomization (Thompson sampling), information-theoretic (EXP3) — represent different strategies for balancing information acquisition against reward maximization.