An RNN is trained on sequences of length 20, but at inference time it needs to process sequences of length 100. Why can a basic RNN architecture handle this without modification, unlike a standard feedforward network?
ARNNs automatically resize their weight matrices to match sequence length at inference time
BRNNs use weight sharing — the same weight matrices process every time step — so the architecture is independent of sequence length
CRNNs store all sequence elements in a fixed-size lookup table, allowing variable input sizes
DRNNs cannot handle sequences longer than those seen during training; the question assumes a capability RNNs lack
Weight sharing is the key architectural feature. The same matrices W_h and W_x are used at every time step regardless of position. There are no separate parameters for step 1, step 2, etc. This means the network applies the same learned transformation at every position, making it naturally applicable to any sequence length. A feedforward network, by contrast, has a fixed input layer size and cannot process inputs of different lengths without architecture changes.
Question 2 Multiple Choice
During backpropagation through time on a 50-step sequence, the gradient of the loss with respect to the initial hidden state involves a chain rule product of 50 Jacobian matrices. What is the most likely problem this creates, and why?
AMemory overflow, because storing 50 intermediate hidden states requires too much RAM
BVanishing gradients: if the weight matrix has eigenvalues less than 1, repeated multiplication drives gradient magnitudes exponentially toward zero, preventing learning of long-range dependencies
CComputational expense is the primary issue, not gradient flow — the math works correctly but slowly
DVanishing gradients only affect the output layer; internal layers receive normal gradient signals
Vanishing gradients are the central training problem for RNNs. The chain rule for BPTT requires multiplying many Jacobian matrices together — one per time step. If the spectral radius of the recurrent weight matrix is less than 1, these products decay exponentially. Gradients reaching the early time steps become negligibly small, and those early steps receive no useful training signal. This means the RNN cannot learn that something at step 1 matters for a prediction at step 50. This motivated LSTM and GRU architectures.
Question 3 True / False
RNNs can theoretically learn to depend on any arbitrarily distant past input in a sequence because the hidden state carries most prior information forward indefinitely.
TTrue
FFalse
Answer: False
In theory, the hidden state carries information forward through the entire sequence. In practice, the vanishing gradient problem prevents training from learning long-range dependencies. The hidden state at step t is influenced by past inputs, but the training signal is too weak to learn that a dependency exists across many steps. The state can carry information, but gradients don't flow back far enough to teach the network which long-range information to retain. This is why LSTM's learned gates are needed — they provide gradient pathways that resist vanishing.
Question 4 True / False
Gradient clipping is a complete solution to the gradient instability problem in RNNs because it prevents both vanishing and exploding gradients.
TTrue
FFalse
Answer: False
Gradient clipping addresses exploding gradients (by rescaling gradients when their norm exceeds a threshold), but does nothing for vanishing gradients. Vanishing gradients are the more fundamental problem for learning long-range dependencies — the gradient simply isn't there to clip. Clipping helps stabilize training but the vanishing problem requires architectural solutions like LSTM or GRU, which provide learned gates that maintain gradient flow over long sequences.
Question 5 Short Answer
Explain why an RNN's hidden state is both its greatest strength and the source of its main training challenge.
Think about your answer, then reveal below.
Model answer: The hidden state enables sequence modeling: it accumulates information from prior steps and passes it forward, giving the network memory across the sequence. This is what makes RNNs suitable for variable-length inputs where order matters. But training requires gradients to flow backward through every time step via backpropagation through time (BPTT). Because the hidden state is computed by repeatedly applying the same recurrent weight matrix, the backward pass involves multiplying many Jacobians together. This causes gradients to vanish or explode exponentially with sequence length, making it very difficult to learn which early inputs matter for a late prediction.
The same mechanism that gives RNNs their memory — the recurrent loop — is what makes training hard. The hidden state is a compressed representation that information flows through, but gradients must flow backward through the same bottleneck. Gated architectures (LSTM, GRU) were designed specifically to decouple the forward memory from the backward gradient flow, using gates to create pathways where gradients can pass without repeated matrix multiplication.