Questions — Bandit Problems (Multi-Armed Bandits)

Question 1 Multiple Choice

In the multi-armed bandit setting, why can't you simply use the Hedge algorithm from the full-information expert problem?

AHedge requires continuous loss values, but bandits only produce binary rewards

BHedge updates weights for ALL experts each round, but in bandits you only observe the reward of the chosen arm — you lack the information to update the weights of unchosen arms

CHedge has higher computational complexity than bandit algorithms

DHedge assumes the losses are adversarial, but bandit problems assume stochastic rewards

Question 2 Multiple Choice

The UCB1 algorithm selects the arm with the highest value of (empirical mean + sqrt(2 * ln(t) / n_i)), where n_i is the number of times arm i has been pulled. What does the second term accomplish?

AIt adds random noise to encourage exploration in early rounds

BIt is a confidence bonus that is larger for arms that have been pulled fewer times, ensuring that under-explored arms are given a fair chance — implementing 'optimism in the face of uncertainty'

CIt penalizes arms that have been pulled too often to prevent over-exploitation

DIt corrects for the bias in the empirical mean estimate when sample sizes are small

Question 3 True / False

In a stochastic K-armed bandit, the minimax optimal regret is Theta(sqrt(KT)), while in the full-information setting with K experts it is Theta(sqrt(T ln K)). The extra sqrt(K/ln K) factor is the price of partial feedback.

TTrue

FFalse

Question 4 Short Answer

Explain the exploration-exploitation dilemma in bandits and why it does not arise in the full-information expert setting.

Think about your answer, then reveal below.

Questions: Bandit Problems (Multi-Armed Bandits)