Agents balance exploring unknown actions (gathering information) with exploiting known high-reward actions. ε-greedy explores with probability ε; UCB optimism balances via upper confidence bounds; Thompson sampling uses posterior uncertainty. No universally optimal strategy exists.
From reinforcement learning, you know that an agent learns by interacting with an environment, taking actions, and receiving rewards. The agent's goal is to maximize cumulative reward over time. But here is the fundamental tension: to maximize reward, the agent should take the action it believes is best (exploitation). Yet its beliefs might be wrong — the action it has tried most might not actually be the best, and some untried action could yield much higher reward. To discover this, the agent must try unfamiliar actions (exploration), which costs immediate reward. This is the exploration-exploitation tradeoff, and it appears everywhere an agent must make sequential decisions under uncertainty.
The cleanest setting to study this tradeoff is the multi-armed bandit problem. Imagine a row of slot machines, each with an unknown payout probability. You have a fixed number of pulls. Pure exploitation means always pulling the lever with the highest observed average payout — but if you tried each machine only once, your estimates are noisy and you might be stuck on a mediocre machine. Pure exploration means pulling levers at random to gather more data — but you waste pulls on machines you already know are bad. The optimal strategy must balance both, and the right balance depends on how many pulls remain (more remaining = more value from exploration) and how uncertain you are about each machine.
The ε-greedy strategy is the simplest approach: with probability (1−ε), take the best-known action; with probability ε, take a random action. It works, but crudely — it explores all actions equally regardless of how much you already know about them, and the fixed ε does not adapt as uncertainty decreases over time (though ε can be decayed on a schedule). Upper Confidence Bound (UCB) methods are more principled: for each action, compute an optimistic estimate that equals the observed average reward plus a bonus proportional to uncertainty (which shrinks as you try that action more). The agent always picks the action with the highest optimistic estimate, so it naturally explores uncertain actions and stops exploring once it is confident. The principle is "optimism in the face of uncertainty" — assume the best until proven otherwise.
Thompson sampling takes a Bayesian approach that you can appreciate from probability basics. Instead of computing a confidence bound, maintain a full posterior distribution over each action's reward probability. To choose an action, draw one random sample from each action's posterior distribution and pick the action whose sample is highest. Actions with high uncertainty produce widely varying samples, so they get selected often — that is exploration. Actions with well-estimated high rewards produce consistently high samples — that is exploitation. As the posteriors tighten with more data, exploration naturally fades. Thompson sampling often matches or beats UCB empirically and extends naturally to more complex settings. The deep insight across all these methods is that exploration is not wasted effort — it is an investment in information that pays off through better exploitation later. The art lies in calibrating how much information is worth gathering given the time remaining and the uncertainty at hand.
No topics depend on this one yet.