A seq2seq model translates short sentences well but quality degrades sharply on paragraphs. What architectural feature most likely causes this?
AThe decoder LSTM cannot process more than one output token at a time
BThe encoder compresses the entire input into a fixed-size vector, losing information for long inputs
CBeam search becomes computationally intractable for long sequences
DLSTMs cannot maintain hidden state for more than 50 steps
The fixed-size context vector is the architectural bottleneck. Regardless of input length, the encoder must compress everything into one dense vector — a short sentence and a long paragraph must both fit into the same dimensionality. For long inputs, details inevitably get lost. This is the exact problem attention mechanisms solve by letting the decoder access all encoder hidden states, not just the final one.
Question 2 Multiple Choice
During decoding, beam search with width k=5 is used instead of greedy decoding. Which best describes what beam search guarantees?
AIt finds the globally optimal output sequence with probability 1
BIt finds an output sequence at least as good as greedy decoding, but the global optimum is not guaranteed
CIt samples k diverse outputs randomly, improving expected quality
DIt guarantees the highest-probability individual token at every step
Beam search maintains the top-k partial sequences at each step and selects the highest-scoring complete sequence. It is strictly better than greedy decoding (k=1) because it considers more candidates, but it does not exhaustively search all possible outputs. The global optimum can still be missed if it was never in the beam. Beam search is a practical approximation, not an exact algorithm.
Question 3 True / False
In a seq2seq model without attention, the decoder can primarily use information about the first few input tokens because LSTM hidden states decay over time.
TTrue
FFalse
Answer: False
This conflates two issues. In a standard seq2seq model, the decoder uses the encoder's final hidden state — which ideally summarizes the entire input, not just the early tokens. The problem is not that early tokens are forgotten but that the final hidden state is a fixed-size vector that must encode everything, and very long sequences overload this fixed capacity. Attention solves a different problem: it lets the decoder actively query specific positions at each generation step, rather than relying solely on one summary vector.
Question 4 True / False
With attention, the decoder can place different amounts of focus on different input positions at each generation step, rather than being restricted to a single fixed context vector.
TTrue
FFalse
Answer: True
This is the defining property of attention. At each decoding step, the attention mechanism computes a weighted sum over all encoder hidden states, where the weights are learned based on compatibility between the current decoder state and each encoder state. The resulting context vector is different at each step — when generating a verb in translation, the model attends to the source verb; when generating a noun, it attends to the source noun. This dynamic access is what overcomes the fixed-vector bottleneck.
Question 5 Short Answer
Why does the information bottleneck in a standard encoder-decoder model become a problem for long sequences, and how does attention address it?
Think about your answer, then reveal below.
Model answer: The encoder must compress the entire input into a single fixed-size vector regardless of input length. For short inputs this works well, but long inputs contain more information than a fixed-size vector can represent — early content gets overwritten or diluted. Attention removes the bottleneck by keeping all encoder hidden states available and letting the decoder dynamically query the most relevant ones at each step, forming a different weighted combination depending on what is being generated.
The bottleneck is a capacity problem: a finite-dimensional vector is asked to carry infinite work as input grows. Attention replaces the fixed summary with a learnable lookup — at each decoding step, it computes how relevant each input position is to the current output, then creates a context vector as their weighted average. The encoder still processes the whole input, but nothing is permanently discarded — everything remains accessible.