An agent is learning to navigate a maze using Q-learning. Partway through training, you switch to a purely greedy policy — the agent always picks the action with the highest current Q-value and never explores. What is the most likely consequence?
AThe Q-values continue to converge correctly, since the agent still receives reward signals
BThe agent reaches the optimal policy faster, since it stops wasting steps on suboptimal actions
CThe Q-values may converge to a suboptimal policy if the agent never discovers better paths it has not yet explored
DThe Q-values stop updating entirely because the temporal difference error becomes zero
Q-learning requires exploration to guarantee convergence to the optimal policy. If the agent always acts greedily on partially-learned Q-values, it may become trapped — exploiting paths it already knows while never discovering better routes it has not yet visited. The temporal difference error becomes zero only when Q-values have fully converged to the true optimal values, not merely when the agent acts greedily. Option B is the tempting misconception: exploration feels 'wasteful,' but it is essential for finding the global optimum.
Question 2 Multiple Choice
Q-learning updates Q(s, a) using max_a' Q(s', a') rather than Q(s', actual next action). What does this choice make Q-learning?
AOn-policy — the agent learns the value of the policy it is currently following, including exploratory steps
BOff-policy — the agent learns the value of the optimal policy regardless of which action it actually takes next
CModel-based — the max operator implicitly models all possible next states
DOn-policy — using the maximum ensures the learning target matches the agent's behavior policy
Using max_a' Q(s', a') evaluates the best possible action at the next state, not the action the agent actually takes. This means Q-learning learns the optimal (greedy) policy even while the agent follows an exploratory behavior policy. This off-policy property allows ε-greedy exploration: the agent takes random actions to discover new transitions, but updates always target optimal value estimates. SARSA (option A) is the on-policy alternative, which learns the value of the exploration policy itself.
Question 3 True / False
Q-learning can converge to the optimal policy even when the agent takes many random exploratory actions during training.
TTrue
FFalse
Answer: True
This is the defining off-policy property of Q-learning. Because the update rule uses max_a' Q(s', a') — the value of the best action, not the action actually taken — the learning target always points toward the optimal policy regardless of the behavior policy generating the data. As long as every state-action pair is visited sufficiently often, Q-values converge to optimal values even if the agent spent most of its training time taking random actions.
Question 4 True / False
Q-learning requires a model of the environment's transition probabilities P(s'|s,a) to perform its updates.
TTrue
FFalse
Answer: False
Q-learning is model-free: it requires only the tuple (s, a, r, s') observed from direct interaction. The update Q(s,a) ← Q(s,a) + α[r + γ max_a' Q(s',a') − Q(s,a)] uses the actual observed next state and reward — no probability model is needed. This is what distinguishes Q-learning from dynamic programming methods like value iteration, which require the full transition model P(s'|s,a) to compute expected values.
Question 5 Short Answer
Why does the Q-learning update use max_a' Q(s', a') rather than Q(s', actual next action), and what property of Q-learning does this create?
Think about your answer, then reveal below.
Model answer: Using max_a' Q(s', a') targets the value of the best possible action at the next state, regardless of what the agent actually does. This makes Q-learning off-policy: the Q-values converge toward the optimal policy (always-greedy) even when the agent follows an exploratory behavior policy that frequently takes non-greedy actions. The separation between the behavior policy (what the agent does) and the target policy (what updates point toward) allows the agent to explore freely without corrupting its estimate of optimal values.
If the update used Q(s', a_actual) — as SARSA does — the learned values would reflect the exploratory policy's value, not the optimal policy's value. Off-policy learning is powerful because it allows experience from imperfect or random behavior to be used for learning an optimal policy.