Questions: Q-Learning Algorithm

5 questions to test your understanding

Score: 0 / 5
Question 1 Multiple Choice

An agent is learning to navigate a maze using Q-learning. Partway through training, you switch to a purely greedy policy — the agent always picks the action with the highest current Q-value and never explores. What is the most likely consequence?

AThe Q-values continue to converge correctly, since the agent still receives reward signals
BThe agent reaches the optimal policy faster, since it stops wasting steps on suboptimal actions
CThe Q-values may converge to a suboptimal policy if the agent never discovers better paths it has not yet explored
DThe Q-values stop updating entirely because the temporal difference error becomes zero
Question 2 Multiple Choice

Q-learning updates Q(s, a) using max_a' Q(s', a') rather than Q(s', actual next action). What does this choice make Q-learning?

AOn-policy — the agent learns the value of the policy it is currently following, including exploratory steps
BOff-policy — the agent learns the value of the optimal policy regardless of which action it actually takes next
CModel-based — the max operator implicitly models all possible next states
DOn-policy — using the maximum ensures the learning target matches the agent's behavior policy
Question 3 True / False

Q-learning can converge to the optimal policy even when the agent takes many random exploratory actions during training.

TTrue
FFalse
Question 4 True / False

Q-learning requires a model of the environment's transition probabilities P(s'|s,a) to perform its updates.

TTrue
FFalse
Question 5 Short Answer

Why does the Q-learning update use max_a' Q(s', a') rather than Q(s', actual next action), and what property of Q-learning does this create?

Think about your answer, then reveal below.