Questions: LSTM and Gated Recurrent Units

5 questions to test your understanding

Score: 0 / 5
Question 1 Multiple Choice

Why does the vanilla RNN fail to learn long-range dependencies, and how does the LSTM cell state address this?

AVanilla RNNs have too few parameters to capture distant patterns; LSTMs add more weight matrices that explicitly attend to earlier time steps
BIn vanilla RNNs, gradients are multiplied by the same weight matrix at every step, causing exponential decay or explosion; the LSTM cell state uses additive updates controlled by gates, allowing gradients to flow back without repeated squashing through nonlinearities
CVanilla RNNs process each time step independently and discard prior context; LSTMs concatenate all past hidden states into a growing memory buffer
DVanilla RNNs cannot handle sequences longer than the training window; LSTMs use attention to directly access any past time step regardless of distance
Question 2 Multiple Choice

A language model must remember whether a sentence began with a question word ('Who', 'What', 'Why') in order to correctly generate a response token 200 steps later. Which architecture handles this most reliably?

AA vanilla RNN, because it passes a hidden state forward at every time step and accumulates context continuously
BAn LSTM, because its forget gate can learn to maintain the relevant information in the cell state across 200 steps by outputting values near 1 for that memory dimension
CA GRU, because fewer parameters reduce overfitting on the rare question-word event and improve generalization
DA feed-forward network with a fixed context window of 200 tokens, since explicit indexing avoids gradient decay entirely
Question 3 True / False

The forget gate in an LSTM can, in principle, preserve a piece of information indefinitely across an unlimited number of time steps by learning to output values close to 1 for the corresponding cell state dimension.

TTrue
FFalse
Question 4 True / False

GRUs consistently outperform LSTMs on tasks requiring very long-range memory because their simpler two-gate architecture provides more efficient gradient flow.

TTrue
FFalse
Question 5 Short Answer

What is the fundamental architectural insight that allows LSTMs to maintain long-range dependencies, and why does the vanilla RNN fail at this?

Think about your answer, then reveal below.