Questions: Transformer Theory and Attention Mechanisms

4 questions to test your understanding

Score: 0 / 4
Question 1 Multiple Choice

In self-attention, the attention weights are computed as softmax(Q @ K^T / sqrt(d)). What is the purpose of dividing by sqrt(d)?

ATo normalize gradients and stabilize training
BTo scale attention scores so that softmax gradients don't vanish when key dimension d is large
CTo reduce memory usage during attention computation
DThe scaling factor is arbitrary and has no effect on performance
Question 2 Short Answer

Transformers replaced RNNs by using self-attention. What is the key advantage of self-attention over RNNs for sequential processing?

Think about your answer, then reveal below.
Question 3 Multiple Choice

A transformer with multiple attention heads computes attention separately for each head, then concatenates the results. Why use multiple heads instead of one large attention mechanism?

AMultiple heads have no advantage; they are purely for computational efficiency
BMultiple heads allow the model to attend to different types of relationships in different head; some heads may focus on syntax, others on semantics
CMultiple heads reduce overfitting by regularizing attention
DMultiple heads increase model capacity without changing parameter count
Question 4 True / False

Transformer architecture has remained largely unchanged since 2017 (self-attention + feed-forward layers + layer norm + residuals). Why has this simple architecture proven so effective across diverse domains?

TTrue
FFalse