Questions — Transformer Architecture

Question 1 Multiple Choice

Why did the transformer architecture enable scaling language models to hundreds of billions of parameters when LSTM-based architectures could not practically reach that scale?

ATransformers have fewer parameters per layer than LSTMs, allowing deeper networks at equivalent computational cost

BAll token relationships in a transformer are computed as matrix multiplications that execute in parallel on GPUs, eliminating the sequential bottleneck that forced RNNs to process one token at a time during training

CTransformers use residual connections which prevent vanishing gradients, while LSTMs lack this mechanism entirely

DTransformers can handle variable-length inputs natively, while RNNs require fixed-length padding that wastes computation

Question 2 Multiple Choice

If positional encodings were completely removed from a transformer — with all other components unchanged — what would happen to the model's behavior?

AThe model would fail entirely, since attention requires positional offsets to compute similarity scores

BThe model would treat any permutation of the same tokens as an identical input, losing all sensitivity to word order

COnly the cross-attention layers would be affected; self-attention layers would still capture order through learned weights

DThe model would effectively become a bag-of-words model with no sequential structure at all

Question 3 True / False

In a well-trained transformer, different attention heads within the same multi-head attention layer can specialize to capture different types of relationships simultaneously.

TTrue

FFalse

Question 4 True / False

During transformer training, all sequence positions can be processed simultaneously because self-attention does not maintain or read from any sequential hidden state.

TTrue

FFalse

Question 5 Short Answer

Explain why self-attention in a transformer requires positional encodings to be explicitly added, whereas an LSTM processes order implicitly without any such mechanism.

Think about your answer, then reveal below.

Questions: Transformer Architecture