What does the GRU's update gate accomplish that requires two separate gates in an LSTM?
AIt controls whether the hidden state is transmitted to the output layer
BIt simultaneously handles what to forget from the old state and what new information to incorporate — merging the LSTM's forget gate and input gate into one
CIt applies a nonlinear transformation to the input so the network can learn complex patterns
DIt selects which elements of the hidden state to reset to zero between sequences
In an LSTM, the forget gate decides how much of the cell state to erase, and the input gate decides how much new candidate information to write — two separate sigmoid activations with separate weight matrices. The GRU's update gate merges these: the fraction of the new candidate state incorporated automatically determines what fraction of the old state is retained, since the final state is a linear interpolation of old and new. This simplification reduces parameter count while preserving the core gating behavior.
Question 2 Multiple Choice
A team is training a model on sequences of moderate length with a limited dataset and tight computational budget. Which consideration most favors using a GRU over an LSTM?
AGRUs are guaranteed to outperform LSTMs on all natural language tasks
BGRUs are always faster to train than LSTMs regardless of sequence length or hardware
CGRUs have fewer parameters than an equivalently sized LSTM, reducing overfitting risk on small datasets and lowering training cost
DGRUs handle the vanishing gradient problem more effectively than LSTMs because they have fewer gates
With roughly 75% of an LSTM's parameter count for the same hidden size, GRUs train faster, use less memory, and are less prone to overfitting on limited data. On most tasks, performance is comparable to LSTMs. GRUs do not universally outperform LSTMs (option A is false), and speed advantage depends on sequence length and hardware (option B overstates). Both architectures address vanishing gradients via gating — the GRU's advantage is computational efficiency, not fundamentally better gradient flow.
Question 3 True / False
Like LSTMs, GRUs maintain two separate memory vectors: a cell state for long-term memory and a hidden state for short-term context.
TTrue
FFalse
Answer: False
This is a key architectural difference. LSTMs have two vectors: the cell state c (which flows with relatively little modification, serving as the long-term memory highway) and the hidden state h. GRUs eliminate the cell state entirely, maintaining only a single hidden state h. The update gate's linear interpolation formula provides the gradient flow benefit of the LSTM cell state without a separate memory vector — and this simplification is what reduces GRU parameter count.
Question 4 True / False
The reset gate in a GRU controls how much of the previous hidden state is used when computing the candidate new hidden state.
TTrue
FFalse
Answer: True
The reset gate r gates how much h_{t-1} contributes to the candidate state h̃ = tanh(W·[r⊙h_{t-1}, x_t]). When r ≈ 0, the candidate is computed almost entirely from the current input — useful at sequence boundaries or after rare events. When r ≈ 1, the candidate blends history and current input like a standard RNN. This allows the GRU to selectively discard stale history while retaining access to it when relevant.
Question 5 Short Answer
How does the GRU's update gate prevent the vanishing gradient problem during backpropagation through time?
Think about your answer, then reveal below.
Model answer: The update gate creates a direct linear pathway for gradients to flow backward. The hidden state update h_t = (1−z)⊙h_{t-1} + z⊙h̃ is a linear interpolation: gradients can pass backward through h_{t-1} via the direct additive term (1−z), which is never multiplied through a chain of sigmoid derivatives that would shrink it toward zero. This additive, linear gradient path is analogous to the LSTM cell state highway and allows both architectures to learn long-range dependencies that vanilla RNNs cannot.
The vanishing gradient problem in vanilla RNNs arises because the gradient is multiplied by the same recurrent weight matrix at every time step — repeated multiplication by values less than 1 causes exponential shrinkage. The linear interpolation in the GRU breaks this multiplicative chain, creating an additive gradient path that remains meaningful even over many time steps.