5 questions to test your understanding
Suppose you remove the positional encodings from a Transformer model and train it on a sentence classification task. What is the most fundamental consequence?
Why are the raw dot-product attention scores divided by √dₖ before applying the softmax in self-attention?
Self-attention inherently captures the order of tokens in a sequence, which is why Transformers can model word order without needing positional encodings.
In multi-head attention, different attention heads can specialize in capturing different types of relationships — such as syntactic dependencies and coreference — without any explicit supervision about which head should learn which pattern.
Why is self-attention described as having O(n²) computational cost with respect to sequence length, and what does this imply for very long sequences?