Questions — Self-Attention and Multi-Head Attention

Question 1 Multiple Choice

Suppose you remove the positional encodings from a Transformer model and train it on a sentence classification task. What is the most fundamental consequence?

AThe model can no longer compute attention scores because the Q, K, V projections depend on position

BThe model treats every permutation of the same words as identical, losing all sensitivity to word order

CMulti-head attention stops working because heads require positional information to specialize

DThe model attends only to adjacent tokens, losing the ability to model long-range dependencies

Question 2 Multiple Choice

Why are the raw dot-product attention scores divided by √dₖ before applying the softmax in self-attention?

ATo normalize scores to the range [0, 1] before softmax can be applied

BTo prevent very large dot products from pushing the softmax into a near-zero-gradient region, which would slow training

CTo ensure that the output of each attention head has the same variance as the input

DTo make attention scores independent of the model dimension, allowing the same architecture to work at any scale

Question 3 True / False

Self-attention inherently captures the order of tokens in a sequence, which is why Transformers can model word order without needing positional encodings.

TTrue

FFalse

Question 4 True / False

In multi-head attention, different attention heads can specialize in capturing different types of relationships — such as syntactic dependencies and coreference — without any explicit supervision about which head should learn which pattern.

TTrue

FFalse

Question 5 Short Answer

Why is self-attention described as having O(n²) computational cost with respect to sequence length, and what does this imply for very long sequences?

Think about your answer, then reveal below.

Questions: Self-Attention and Multi-Head Attention