Questions — Q-Learning Algorithm — Open Knowledge Graph

Question 1 Multiple Choice

An agent is learning to navigate a maze using Q-learning. Partway through training, you switch to a purely greedy policy — the agent always picks the action with the highest current Q-value and never explores. What is the most likely consequence?

AThe Q-values continue to converge correctly, since the agent still receives reward signals

BThe agent reaches the optimal policy faster, since it stops wasting steps on suboptimal actions

CThe Q-values may converge to a suboptimal policy if the agent never discovers better paths it has not yet explored

DThe Q-values stop updating entirely because the temporal difference error becomes zero

Question 2 Multiple Choice

Q-learning updates Q(s, a) using max_a' Q(s', a') rather than Q(s', actual next action). What does this choice make Q-learning?

AOn-policy — the agent learns the value of the policy it is currently following, including exploratory steps

BOff-policy — the agent learns the value of the optimal policy regardless of which action it actually takes next

CModel-based — the max operator implicitly models all possible next states

DOn-policy — using the maximum ensures the learning target matches the agent's behavior policy

Question 3 True / False

Q-learning can converge to the optimal policy even when the agent takes many random exploratory actions during training.

TTrue

FFalse

Question 4 True / False

Q-learning requires a model of the environment's transition probabilities P(s'|s,a) to perform its updates.

TTrue

FFalse

Question 5 Short Answer

Why does the Q-learning update use max_a' Q(s', a') rather than Q(s', actual next action), and what property of Q-learning does this create?

Think about your answer, then reveal below.

Questions: Q-Learning Algorithm