Questions: Transformer Theory and Attention Mechanisms
4 questions to test your understanding
Score: 0 / 4
Question 1 Multiple Choice
In self-attention, the attention weights are computed as softmax(Q @ K^T / sqrt(d)). What is the purpose of dividing by sqrt(d)?
ATo normalize gradients and stabilize training
BTo scale attention scores so that softmax gradients don't vanish when key dimension d is large
CTo reduce memory usage during attention computation
DThe scaling factor is arbitrary and has no effect on performance
The dot product Q @ K^T has expected magnitude sqrt(d), where d is the key dimension. When d is large, attention scores can have very large magnitudes, causing softmax to saturate (gradients vanish). Dividing by sqrt(d) normalizes the scores to unit variance, keeping softmax in a range where gradients are non-negligible. This stabilizes training and is essential for deep transformers where gradient flow matters. Without this scaling, gradients would propagate poorly through early layers, and training would be unstable.
Question 2 Short Answer
Transformers replaced RNNs by using self-attention. What is the key advantage of self-attention over RNNs for sequential processing?
Think about your answer, then reveal below.
Model answer: Self-attention computes relationships between all positions in parallel, enabling efficient processing of long sequences and better gradient flow during backpropagation. RNNs process sequentially (one timestep at a time), making them hard to parallelize and causing gradients to decay or explode over long sequences (vanishing/exploding gradient problem). Self-attention's ability to directly compare any two positions (regardless of distance) without intermediate sequential steps enables capturing long-range dependencies, training with better gradient properties, and leveraging GPU parallelization. This is why transformers scale to billion-parameter models while RNNs plateau much earlier.
The parallelizability and gradient flow properties of self-attention are fundamental to modern deep learning's success. Long sequences that would cause RNNs to lose gradient signal are handled gracefully by transformers, enabling both larger models and longer contexts.
Question 3 Multiple Choice
A transformer with multiple attention heads computes attention separately for each head, then concatenates the results. Why use multiple heads instead of one large attention mechanism?
AMultiple heads have no advantage; they are purely for computational efficiency
BMultiple heads allow the model to attend to different types of relationships in different head; some heads may focus on syntax, others on semantics
CMultiple heads reduce overfitting by regularizing attention
DMultiple heads increase model capacity without changing parameter count
Multiple attention heads provide representational diversity. Different heads can learn different attention patterns: some might focus on nearby tokens (local syntax), others on distant related tokens (semantic relationships), and others on special tokens (like delimiters). This decomposition allows the model to learn multiple types of relationships in a single layer, improving expressiveness. Empirically, different heads exhibit interpretable patterns (e.g., head attending to determiners, another to noun-adjective pairs), suggesting that multi-head attention is learning a rich, structured representation.
Question 4 True / False
Transformer architecture has remained largely unchanged since 2017 (self-attention + feed-forward layers + layer norm + residuals). Why has this simple architecture proven so effective across diverse domains?
TTrue
FFalse
Answer: True
The transformer architecture, despite its simplicity, has proven remarkably general. Self-attention is theoretically grounded (universal approximation, equivariance properties), feed-forward layers provide non-linearity, and residual connections enable deep stacking. The architecture has no inductive biases toward specific domains (unlike CNNs for images or RNNs for sequences), making it general-purpose. When scaled with large models and large data (following neural scaling laws), transformers achieve state-of-the-art in language, vision, and multimodal tasks. The lack of change reflects the architecture's fundamental soundness rather than lack of innovation — improvements have focused on training techniques, scaling, and data rather than architectural changes.