Questions: Self-Attention and Multi-Head Attention

5 questions to test your understanding

Score: 0 / 5
Question 1 Multiple Choice

Suppose you remove the positional encodings from a Transformer model and train it on a sentence classification task. What is the most fundamental consequence?

AThe model can no longer compute attention scores because the Q, K, V projections depend on position
BThe model treats every permutation of the same words as identical, losing all sensitivity to word order
CMulti-head attention stops working because heads require positional information to specialize
DThe model attends only to adjacent tokens, losing the ability to model long-range dependencies
Question 2 Multiple Choice

Why are the raw dot-product attention scores divided by √dₖ before applying the softmax in self-attention?

ATo normalize scores to the range [0, 1] before softmax can be applied
BTo prevent very large dot products from pushing the softmax into a near-zero-gradient region, which would slow training
CTo ensure that the output of each attention head has the same variance as the input
DTo make attention scores independent of the model dimension, allowing the same architecture to work at any scale
Question 3 True / False

Self-attention inherently captures the order of tokens in a sequence, which is why Transformers can model word order without needing positional encodings.

TTrue
FFalse
Question 4 True / False

In multi-head attention, different attention heads can specialize in capturing different types of relationships — such as syntactic dependencies and coreference — without any explicit supervision about which head should learn which pattern.

TTrue
FFalse
Question 5 Short Answer

Why is self-attention described as having O(n²) computational cost with respect to sequence length, and what does this imply for very long sequences?

Think about your answer, then reveal below.