In scaled dot-product attention, why is the dot product divided by √d_k before applying softmax?
ATo normalize output values so they fall between 0 and 1
BTo ensure queries and keys are comparable even when they have different vector magnitudes
CTo prevent dot products from growing large in high dimensions, which would push softmax into low-gradient regions and stall learning
DTo ensure attention weights sum to d_k rather than 1, preserving information content
In high dimensions, dot products tend to grow in magnitude proportional to √d_k. Feeding large values into softmax pushes it into near-saturation where outputs are nearly one-hot and gradients become nearly zero — making learning extremely slow. Dividing by √d_k keeps inputs to softmax in a range with healthy gradients. Option A is wrong: softmax always sums to 1 regardless of scaling. Option B is wrong: the scaling controls variance, not relative magnitudes between vectors.
Question 2 Multiple Choice
A transformer model processes 'The trophy didn't fit in the suitcase because it was too big.' To resolve what 'it' refers to, which description best captures what attention does?
AThe model identifies 'it' as the most recently mentioned noun using a fixed positional rule
BThe query vector for 'it' produces high similarity scores with 'trophy' because their learned key-query projections are compatible, so the trophy's value vector dominates the output for 'it'
CMulti-head attention averages all noun representations with equal weights
DThe model resolves the ambiguity using a rule-based dependency parser that runs before attention
In a trained transformer, the query projection for 'it' in this context produces dot products with the key projections of 'trophy' and 'suitcase', assigning high attention weight to whichever is semantically compatible with 'too big.' The 'too big' predicate is incompatible with the suitcase as antecedent (which would need 'too small'). The learned projections capture this semantic relationship. Multi-head attention allows different heads to specialize in different relationship types, and no external parser is needed.
Question 3 True / False
Attention mechanisms allow every position in a sequence to directly attend to every other position simultaneously, unlike recurrent networks which pass information step-by-step.
TTrue
FFalse
Answer: True
This is the core architectural advantage of attention. In an RNN, position 10 can only 'see' earlier positions through hidden states passed sequentially — information from distant positions gets diluted through many steps. In attention, the query at position 10 computes similarity scores against keys from ALL positions simultaneously and forms a weighted combination of their values. There is no sequential bottleneck, and every pair of positions is directly connected regardless of distance. This also enables GPU parallelization.
Question 4 True / False
In multi-head attention with h heads, each head operates on the full d_k dimensional representation, making it strictly more computationally expensive than single-head attention.
TTrue
FFalse
Answer: False
Multi-head attention with h heads operates on projections of dimension d_k/h per head, not the full d_k. Each head attends in a lower-dimensional subspace, and outputs are concatenated back to full model dimension before a final linear projection. The design was deliberately made computationally comparable to single-head attention at full dimensionality, while gaining the ability to capture multiple relationship types in parallel — different heads can specialize in syntactic, semantic, or positional relationships simultaneously.
Question 5 Short Answer
Explain why attention is described as a 'soft' lookup table, and what property of softmax makes the softness possible.
Think about your answer, then reveal below.
Model answer: A hard lookup table returns the value for the single exact matching key. Attention is 'soft' because the query is compared to every key, softmax converts the similarity scores into a probability distribution (weights summing to 1), and the output is a weighted combination of ALL values — not just the best match. Every value contributes to the output, with contributions proportional to query-key similarity. The softness comes from softmax producing non-zero weights for every key, which also makes the operation differentiable everywhere — gradients flow to all keys proportional to their current relevance, enabling end-to-end learning of Q/K/V projections.
Differentiability is crucial. A hard argmax would select one key but produce zero gradients for all others, making it impossible to learn which keys are relevant. Softmax keeps all gradients active, which is why attention can be trained effectively.