Questions — LSTM and Gated Recurrent Units

Question 1 Multiple Choice

Why does the vanilla RNN fail to learn long-range dependencies, and how does the LSTM cell state address this?

AVanilla RNNs have too few parameters to capture distant patterns; LSTMs add more weight matrices that explicitly attend to earlier time steps

BIn vanilla RNNs, gradients are multiplied by the same weight matrix at every step, causing exponential decay or explosion; the LSTM cell state uses additive updates controlled by gates, allowing gradients to flow back without repeated squashing through nonlinearities

CVanilla RNNs process each time step independently and discard prior context; LSTMs concatenate all past hidden states into a growing memory buffer

DVanilla RNNs cannot handle sequences longer than the training window; LSTMs use attention to directly access any past time step regardless of distance

Question 2 Multiple Choice

A language model must remember whether a sentence began with a question word ('Who', 'What', 'Why') in order to correctly generate a response token 200 steps later. Which architecture handles this most reliably?

AA vanilla RNN, because it passes a hidden state forward at every time step and accumulates context continuously

BAn LSTM, because its forget gate can learn to maintain the relevant information in the cell state across 200 steps by outputting values near 1 for that memory dimension

CA GRU, because fewer parameters reduce overfitting on the rare question-word event and improve generalization

DA feed-forward network with a fixed context window of 200 tokens, since explicit indexing avoids gradient decay entirely

Question 3 True / False

The forget gate in an LSTM can, in principle, preserve a piece of information indefinitely across an unlimited number of time steps by learning to output values close to 1 for the corresponding cell state dimension.

TTrue

FFalse

Question 4 True / False

GRUs consistently outperform LSTMs on tasks requiring very long-range memory because their simpler two-gate architecture provides more efficient gradient flow.

TTrue

FFalse

Question 5 Short Answer

What is the fundamental architectural insight that allows LSTMs to maintain long-range dependencies, and why does the vanilla RNN fail at this?

Think about your answer, then reveal below.

Questions: LSTM and Gated Recurrent Units