A student implements DQN but omits experience replay, training directly on consecutive game transitions in order. They observe the agent rapidly learns one section of the game but then 'forgets' earlier patterns. What does this illustrate?
AThe neural network has insufficient capacity to memorize the entire game's state space
BConsecutive game frames are highly correlated, causing the network to overfit to recent experience and catastrophically forget lessons from earlier states
CThe target network is updating too slowly to keep pace with the rapidly changing policy
DConvolutional layers cannot generalize across different game screen positions without replay
Gradient descent assumes approximately i.i.d. training samples. Consecutive game frames violate this — they are highly temporally correlated (each frame strongly resembles the previous one). Without replay, the network sees a stream of similar recent experiences and adjusts weights aggressively toward them, overwriting the knowledge encoded from earlier, different experiences. Experience replay breaks this correlation by sampling random minibatches from a large buffer spanning diverse past experiences.
Question 2 Multiple Choice
What problem does the DQN target network solve, and how?
AIt provides extra training data by generating synthetic rollouts when real experience is sparse
BIt prevents Q-values from diverging to infinity by clamping the maximum target value to a fixed scale
CIt stabilizes learning by providing temporarily stationary targets: a frozen copy of the network computes training targets, updated only periodically so the Q-network learns toward a stable objective
DIt ensures exploration by generating random actions until the main network's Q-values converge
In standard Q-learning with a neural network, the same weights are used both to select actions and to compute target values (r + γ max Q(s', a')). Every weight update changes the targets immediately, creating a moving-target problem — like trying to hit a target that shifts every time you shoot. The target network solves this by holding a frozen copy whose weights are only periodically synced (every few thousand steps) from the main network, making the targets temporarily stationary and giving the main network a stable loss to minimize.
Question 3 True / False
DQN can learn directly from raw pixel inputs because convolutional layers extract spatial features that the fully connected output layers map to per-action Q-values.
TTrue
FFalse
Answer: True
This is the architectural contribution of DQN: stacking convolutional layers to process the raw game screen (a 2D image) extracts spatially meaningful features (edges, objects, sprites) without hand-crafted feature engineering. The subsequent fully connected layers then map these visual features to a Q-value for each available action, producing a single forward pass that estimates the value of every action simultaneously.
Question 4 True / False
Without experience replay, DQN would still converge because the Q-learning update rule is mathematically designed to handle correlated sequential observations.
TTrue
FFalse
Answer: False
The Q-learning update rule (from tabular RL) guarantees convergence under certain conditions for tabular settings, but those conditions assume i.i.d.-like sampling over the state space. When combined with a neural network, temporally correlated training samples cause gradient descent to overfit to recent experience at the cost of earlier knowledge — a phenomenon that destabilized early neural network Q-learning attempts. Experience replay is specifically designed to mitigate this by decorrelating the training distribution, and it was empirically essential for DQN's stability.
Question 5 Short Answer
Why was combining neural networks with Q-learning notoriously unstable before DQN, and which two innovations made it tractable?
Think about your answer, then reveal below.
Model answer: Two interacting instabilities plagued early attempts. First, consecutive game transitions are temporally correlated — the network sees highly similar states in sequence and overfits to recent experience, effectively forgetting earlier lessons. Second, the same network computes both the predicted Q-values and the training targets, so each weight update immediately changes the targets, creating a moving-target problem that can lead to oscillation or divergence. DQN addressed these with experience replay (storing transitions in a large replay buffer and sampling random minibatches to break temporal correlation) and a target network (a separate frozen copy of the Q-network used only for computing targets, periodically synced from the main network to provide a stable learning signal).
These two innovations were not independently obvious — the insight was that both instabilities needed to be addressed simultaneously. Without either one, deep RL remained impractical for high-dimensional inputs like Atari game screens.