An ε-greedy agent (ε=0.1) has tried each of 10 slot machines exactly 1,000 times and has well-estimated average payouts for all of them. A critic says the agent is now exploring too wastefully. What is the most accurate diagnosis of ε-greedy's problem in this situation?
Aε-greedy continues exploring all machines equally regardless of how confident the agent is, wasting 10% of pulls on machines already known to be inferior
Bε-greedy explores too little at this stage — after 1,000 trials per machine the agent should increase ε to refine its estimates further
Cε-greedy is optimal here because 10% is the statistically correct exploration rate for 10 machines with 1,000 trials each
Dε-greedy should be replaced with pure exploitation (ε=0) only after the total number of trials exceeds the square root of the number of machines times 1,000
After 1,000 trials per machine, the agent has accurate estimates and high confidence. Yet fixed ε-greedy still randomly explores *all* machines with equal probability 10% of the time, wasting pulls on machines confirmed to be inferior. UCB handles this better: its exploration bonus shrinks as uncertainty decreases, so well-characterized machines stop being explored once their estimates are stable. ε-greedy's inability to adapt exploration to uncertainty is its primary structural weakness — it treats a machine tried 5 times the same as one tried 1,000 times.
Question 2 Multiple Choice
A Thompson sampling agent for a 5-arm bandit problem has tried arm 3 only twice and has a very wide posterior distribution for its reward probability. What mechanically causes the agent to explore arm 3 frequently despite no explicit exploration rule?
AThe wide posterior produces high-variance samples, so arm 3 frequently generates the highest sampled value among all arms and gets selected
BThompson sampling adds an explicit exploration bonus to arms with few trials, similar to UCB's confidence interval
CThompson sampling always selects the arm with the lowest observed average reward to gather maximally diverse data
DThe posterior distribution for arm 3 has a higher mean than arms tried more often, making it preferentially selected
Thompson sampling selects an arm by drawing one sample from each arm's posterior and picking the highest. An arm with few trials has a wide, uncertain posterior — its samples span a large range. Even if its observed average is mediocre, it will occasionally produce very high samples (because the distribution is wide), causing it to win the comparison and be selected. As more data is collected, the posterior narrows, samples become less variable, and the arm is explored less frequently unless it genuinely has high reward. Exploration emerges naturally from Bayesian uncertainty — no explicit bonus rule is needed.
Question 3 True / False
An agent that always exploits the action with the highest observed average reward — with no exploration — can perform suboptimally even if its current best estimate happens to be accurate.
TTrue
FFalse
Answer: True
Even correct initial estimates can become inaccurate over time in non-stationary settings where reward distributions shift. More fundamentally, a pure exploitation agent cannot detect when its beliefs are wrong. If it happens to overestimate a mediocre arm early on, it will commit to that arm indefinitely. The value of occasional exploration is insurance: it provides a mechanism for discovering that current beliefs are incorrect, at the cost of some immediate reward. In stationary settings with accurate initial estimates the cost is low; in real-world settings with uncertainty, pure exploitation is systematically fragile.
Question 4 True / False
UCB (Upper Confidence Bound) methods explore by randomly selecting a non-greedy action with a fixed probability, similar to ε-greedy but with a smaller and more carefully tuned exploration rate.
TTrue
FFalse
Answer: False
UCB does not use random exploration. It deterministically selects the action with the highest 'optimistic estimate' — the observed mean reward plus an uncertainty bonus that is large for rarely-tried actions and small for well-characterized ones. The agent always picks the action with the best upper confidence bound, so exploration is targeted: it concentrates on actions where uncertainty remains high. By contrast, ε-greedy randomly selects among all non-greedy actions equally, regardless of their uncertainty. UCB's 'optimism in the face of uncertainty' principle focuses exploration where it is most likely to change decisions, which is fundamentally different from random exploration.
Question 5 Short Answer
Why is exploration not simply 'wasted effort'? Explain what exploration actually achieves and how its value depends on the situation.
Think about your answer, then reveal below.
Model answer: Exploration is an investment in information. By trying uncertain actions, the agent acquires more accurate estimates of their true reward distributions. Better estimates enable better exploitation decisions in the future — the information gathered through exploration pays dividends through improved choices over all remaining time steps. The value of exploration therefore depends on how many decisions remain: with many steps left, the improved exploitation from better information easily outweighs the immediate cost of suboptimal actions; with few steps remaining, there is little time to recoup the investment.
The key insight is that exploration and exploitation operate on different timescales. Exploitation maximizes *immediate* reward; exploration maximizes *future* reward by reducing uncertainty. This temporal structure explains why optimal strategies (like UCB and Thompson sampling) naturally reduce exploration as more data is gathered — not because exploration becomes bad, but because uncertainty decreases and the marginal value of additional information shrinks. A common misconception is treating exploration as a regrettable cost; it is better understood as portfolio diversification for an uncertain environment.