A robot learning to navigate a maze always chooses the action with the highest known reward (purely greedy strategy). It finds a path yielding +5 reward and consistently follows it. The true optimal path yields +20 but was never explored. This scenario best illustrates:
AA successful application of reinforcement learning — the robot found a working policy.
BThe exploration-exploitation tradeoff: excessive exploitation causes the agent to get stuck in a locally optimal but globally suboptimal policy.
CA failure of the discount factor — the agent valued immediate rewards too highly.
DA model-based failure — the agent needs to learn the transition model first.
This is the exploration-exploitation tradeoff in action. A purely greedy agent never explores, so it can permanently miss better options it has not yet encountered. The robot's policy is locally optimal (best among explored actions) but globally suboptimal (better actions exist but were never tried). Strategies like ε-greedy — acting greedily most of the time but exploring randomly with probability ε — or upper confidence bound methods address this by systematically visiting uncertain actions. The core tension: you can't exploit what you haven't explored, but you can't explore indefinitely either.
Question 2 Multiple Choice
How does reinforcement learning differ most fundamentally from supervised learning?
ARL requires neural networks, while supervised learning can use simpler models.
BIn RL, the agent learns from interaction — receiving reward signals without labeled 'correct answer' examples — while supervised learning trains on labeled input-output pairs provided by a human teacher.
CRL only applies to sequential decision tasks in games, while supervised learning handles real-world problems.
DRL always requires more data than supervised learning to achieve good performance.
The fundamental distinction is the source of learning signal. Supervised learning uses labeled examples: the algorithm is told the correct output for each input. RL uses reward signals: the agent receives feedback on the consequences of its actions, but is never directly told what the right action was. The agent must infer which actions led to good outcomes from delayed, often sparse reward signals. This is why RL can learn to play games with superhuman skill — it doesn't need human-labeled 'correct moves,' only the game's score signal.
Question 3 True / False
In reinforcement learning, a discount factor γ close to 1 causes the agent to value distant future rewards nearly as much as immediate ones, making it more far-sighted in its decision-making.
TTrue
FFalse
Answer: True
The cumulative discounted return is Σ γᵗrₜ. When γ = 1, all future rewards count equally — a reward 100 steps away is worth as much as one received now. When γ = 0, only the immediate reward matters. Intermediate values of γ create exponential discounting: a reward t steps away is worth γᵗ of its face value. Far-sighted behavior (γ → 1) is appropriate when long-term planning matters; myopic behavior (small γ) is appropriate in environments with high uncertainty or very long time horizons where the future is too uncertain to plan for.
Question 4 True / False
Model-free reinforcement learning methods are generally superior to model-based methods because they avoid making assumptions about the environment's transition dynamics.
TTrue
FFalse
Answer: False
The model-free vs model-based tradeoff is not about superiority — each excels in different conditions. Model-based methods are far more sample-efficient: by learning a model of the environment, the agent can simulate experiences and plan without interacting with the real environment repeatedly. Model-free methods (like Q-learning) are more robust because they don't depend on the accuracy of a learned model — an incorrect model can lead to catastrophically wrong planning. In low-data regimes, model-based methods win; in complex environments where accurate models are hard to learn, model-free methods are often preferred.
Question 5 Short Answer
Why is the exploration-exploitation tradeoff a fundamental challenge in reinforcement learning, and what makes it difficult to resolve optimally?
Think about your answer, then reveal below.
Model answer: An RL agent must balance two competing demands: exploiting actions it already knows are good (to maximize reward now) versus exploring unfamiliar actions (to discover potentially better options). The difficulty is that the agent cannot know in advance whether exploring will pay off — it might find a much better policy or waste time on terrible actions. Any exploration policy involves a tradeoff: too little exploration leads to suboptimal policies (missing better options), too much wastes interactions on bad actions. Resolving this optimally is provably hard in general (it relates to the multi-armed bandit problem), which is why heuristic strategies like ε-greedy, UCB, and Thompson sampling are used in practice rather than optimal solutions.
The tradeoff is fundamental because of incomplete information: the agent only knows the value of actions it has tried. Unlike supervised learning (where training data is given), the RL agent must actively generate its own information through interaction. Every action simultaneously pursues reward and generates data — making exploration and exploitation inseparably entangled.