Questions: Transformer Architecture

5 questions to test your understanding

Score: 0 / 5
Question 1 Multiple Choice

Why did the transformer architecture enable scaling language models to hundreds of billions of parameters when LSTM-based architectures could not practically reach that scale?

ATransformers have fewer parameters per layer than LSTMs, allowing deeper networks at equivalent computational cost
BAll token relationships in a transformer are computed as matrix multiplications that execute in parallel on GPUs, eliminating the sequential bottleneck that forced RNNs to process one token at a time during training
CTransformers use residual connections which prevent vanishing gradients, while LSTMs lack this mechanism entirely
DTransformers can handle variable-length inputs natively, while RNNs require fixed-length padding that wastes computation
Question 2 Multiple Choice

If positional encodings were completely removed from a transformer — with all other components unchanged — what would happen to the model's behavior?

AThe model would fail entirely, since attention requires positional offsets to compute similarity scores
BThe model would treat any permutation of the same tokens as an identical input, losing all sensitivity to word order
COnly the cross-attention layers would be affected; self-attention layers would still capture order through learned weights
DThe model would effectively become a bag-of-words model with no sequential structure at all
Question 3 True / False

In a well-trained transformer, different attention heads within the same multi-head attention layer can specialize to capture different types of relationships simultaneously.

TTrue
FFalse
Question 4 True / False

During transformer training, all sequence positions can be processed simultaneously because self-attention does not maintain or read from any sequential hidden state.

TTrue
FFalse
Question 5 Short Answer

Explain why self-attention in a transformer requires positional encodings to be explicitly added, whereas an LSTM processes order implicitly without any such mechanism.

Think about your answer, then reveal below.