Questions: Attention Mechanisms

5 questions to test your understanding

Score: 0 / 5
Question 1 Multiple Choice

In scaled dot-product attention, why is the dot product divided by √d_k before applying softmax?

ATo normalize output values so they fall between 0 and 1
BTo ensure queries and keys are comparable even when they have different vector magnitudes
CTo prevent dot products from growing large in high dimensions, which would push softmax into low-gradient regions and stall learning
DTo ensure attention weights sum to d_k rather than 1, preserving information content
Question 2 Multiple Choice

A transformer model processes 'The trophy didn't fit in the suitcase because it was too big.' To resolve what 'it' refers to, which description best captures what attention does?

AThe model identifies 'it' as the most recently mentioned noun using a fixed positional rule
BThe query vector for 'it' produces high similarity scores with 'trophy' because their learned key-query projections are compatible, so the trophy's value vector dominates the output for 'it'
CMulti-head attention averages all noun representations with equal weights
DThe model resolves the ambiguity using a rule-based dependency parser that runs before attention
Question 3 True / False

Attention mechanisms allow every position in a sequence to directly attend to every other position simultaneously, unlike recurrent networks which pass information step-by-step.

TTrue
FFalse
Question 4 True / False

In multi-head attention with h heads, each head operates on the full d_k dimensional representation, making it strictly more computationally expensive than single-head attention.

TTrue
FFalse
Question 5 Short Answer

Explain why attention is described as a 'soft' lookup table, and what property of softmax makes the softness possible.

Think about your answer, then reveal below.