Questions — Attention Mechanisms — Open Knowledge Graph

Question 1 Multiple Choice

In scaled dot-product attention, why is the dot product divided by √d_k before applying softmax?

ATo normalize output values so they fall between 0 and 1

BTo ensure queries and keys are comparable even when they have different vector magnitudes

CTo prevent dot products from growing large in high dimensions, which would push softmax into low-gradient regions and stall learning

DTo ensure attention weights sum to d_k rather than 1, preserving information content

Question 2 Multiple Choice

A transformer model processes 'The trophy didn't fit in the suitcase because it was too big.' To resolve what 'it' refers to, which description best captures what attention does?

AThe model identifies 'it' as the most recently mentioned noun using a fixed positional rule

BThe query vector for 'it' produces high similarity scores with 'trophy' because their learned key-query projections are compatible, so the trophy's value vector dominates the output for 'it'

CMulti-head attention averages all noun representations with equal weights

DThe model resolves the ambiguity using a rule-based dependency parser that runs before attention

Question 3 True / False

Attention mechanisms allow every position in a sequence to directly attend to every other position simultaneously, unlike recurrent networks which pass information step-by-step.

TTrue

FFalse

Question 4 True / False

In multi-head attention with h heads, each head operates on the full d_k dimensional representation, making it strictly more computationally expensive than single-head attention.

TTrue

FFalse

Question 5 Short Answer

Explain why attention is described as a 'soft' lookup table, and what property of softmax makes the softness possible.

Think about your answer, then reveal below.

Questions: Attention Mechanisms