Questions: Bandit Problems (Multi-Armed Bandits)

4 questions to test your understanding

Score: 0 / 4
Question 1 Multiple Choice

In the multi-armed bandit setting, why can't you simply use the Hedge algorithm from the full-information expert problem?

AHedge requires continuous loss values, but bandits only produce binary rewards
BHedge updates weights for ALL experts each round, but in bandits you only observe the reward of the chosen arm — you lack the information to update the weights of unchosen arms
CHedge has higher computational complexity than bandit algorithms
DHedge assumes the losses are adversarial, but bandit problems assume stochastic rewards
Question 2 Multiple Choice

The UCB1 algorithm selects the arm with the highest value of (empirical mean + sqrt(2 * ln(t) / n_i)), where n_i is the number of times arm i has been pulled. What does the second term accomplish?

AIt adds random noise to encourage exploration in early rounds
BIt is a confidence bonus that is larger for arms that have been pulled fewer times, ensuring that under-explored arms are given a fair chance — implementing 'optimism in the face of uncertainty'
CIt penalizes arms that have been pulled too often to prevent over-exploitation
DIt corrects for the bias in the empirical mean estimate when sample sizes are small
Question 3 True / False

In a stochastic K-armed bandit, the minimax optimal regret is Theta(sqrt(KT)), while in the full-information setting with K experts it is Theta(sqrt(T ln K)). The extra sqrt(K/ln K) factor is the price of partial feedback.

TTrue
FFalse
Question 4 Short Answer

Explain the exploration-exploitation dilemma in bandits and why it does not arise in the full-information expert setting.

Think about your answer, then reveal below.