Why does the vanilla RNN fail to learn long-range dependencies, and how does the LSTM cell state address this?
AVanilla RNNs have too few parameters to capture distant patterns; LSTMs add more weight matrices that explicitly attend to earlier time steps
BIn vanilla RNNs, gradients are multiplied by the same weight matrix at every step, causing exponential decay or explosion; the LSTM cell state uses additive updates controlled by gates, allowing gradients to flow back without repeated squashing through nonlinearities
CVanilla RNNs process each time step independently and discard prior context; LSTMs concatenate all past hidden states into a growing memory buffer
DVanilla RNNs cannot handle sequences longer than the training window; LSTMs use attention to directly access any past time step regardless of distance
The vanishing gradient problem is a gradient flow problem, not a parameter count or attention problem. In backpropagation through time, the gradient of the loss with respect to an early hidden state involves multiplying the recurrent weight matrix by itself many times. If its eigenvalues are less than 1, gradients shrink to zero. The LSTM cell state avoids this by updating additively (c_t = f_t * c_{t-1} + i_t * g_t) rather than multiplicatively. When the forget gate is near 1, the gradient flows back through the cell state highway unchanged, preserving the error signal across many time steps.
Question 2 Multiple Choice
A language model must remember whether a sentence began with a question word ('Who', 'What', 'Why') in order to correctly generate a response token 200 steps later. Which architecture handles this most reliably?
AA vanilla RNN, because it passes a hidden state forward at every time step and accumulates context continuously
BAn LSTM, because its forget gate can learn to maintain the relevant information in the cell state across 200 steps by outputting values near 1 for that memory dimension
CA GRU, because fewer parameters reduce overfitting on the rare question-word event and improve generalization
DA feed-forward network with a fixed context window of 200 tokens, since explicit indexing avoids gradient decay entirely
The LSTM was designed precisely for this scenario. The forget gate learns to preserve certain cell state dimensions by holding them near their current values (forget gate ≈ 1). A vanilla RNN would lose this signal to gradient decay well before 200 steps. The GRU can also handle long-range dependencies — and is often competitive — but for tasks requiring very precise, long-lived memory, the LSTM's separate cell state gives it a structural advantage over the GRU's merged state.
Question 3 True / False
The forget gate in an LSTM can, in principle, preserve a piece of information indefinitely across an unlimited number of time steps by learning to output values close to 1 for the corresponding cell state dimension.
TTrue
FFalse
Answer: True
This is the theoretical guarantee of the LSTM design. If the forget gate output for some dimension is exactly 1 at every time step, the update rule c_t = 1 * c_{t-1} + ... leaves that dimension of the cell state unchanged — information persists without decay. In practice, gates are learned and some drift occurs, but the mechanism genuinely allows much longer retention than vanilla RNNs. This is why LSTMs solved the vanishing gradient problem practically: the gradient through the cell state is not multiplied by anything that repeatedly shrinks it.
Question 4 True / False
GRUs consistently outperform LSTMs on tasks requiring very long-range memory because their simpler two-gate architecture provides more efficient gradient flow.
TTrue
FFalse
Answer: False
The evidence does not support a blanket superiority claim for GRUs on long-range tasks. LSTMs tend to have a slight edge on tasks requiring precise, long-lived memory — such as counting nested brackets or copying specific tokens from far earlier in a sequence — because the separate cell state gives an additional degree of freedom for storing information without interference from the hidden state computation. GRUs are often competitive or faster on many practical tasks, but this reflects training efficiency and dataset characteristics, not a structural advantage in long-range retention.
Question 5 Short Answer
What is the fundamental architectural insight that allows LSTMs to maintain long-range dependencies, and why does the vanilla RNN fail at this?
Think about your answer, then reveal below.
Model answer: The vanilla RNN repeatedly multiplies the hidden state by the same weight matrix, causing gradients to decay or explode exponentially over long sequences. The LSTM introduces a separate cell state that is updated additively rather than multiplicatively, and learned gates determine how much old information to keep and how much new information to write. Because the cell state update is additive and gated, gradients can flow back through time without being repeatedly squashed — information can persist across hundreds of steps.
The key phrase is 'additive updates with gating' vs. 'multiplicative recurrence.' In a vanilla RNN, h_t = tanh(W·h_{t-1} + ...) — the same matrix W and same nonlinearity at every step. In an LSTM, the cell state update is c_t = f_t ⊙ c_{t-1} + i_t ⊙ g_t — a blend of the old state and a new candidate, where f_t (forget gate) can be near 1 to preserve the old state. This creates a 'gradient highway' that avoids the repeated squashing problem.