Questions: Transformer Theory and Attention Mechanisms

4 questions to test your understanding

Score: 0 / 4

Question 1 Multiple Choice

In self-attention, the attention weights are computed as softmax(Q @ K^T / sqrt(d)). What is the purpose of dividing by sqrt(d)?

ATo normalize gradients and stabilize training

BTo scale attention scores so that softmax gradients don't vanish when key dimension d is large

CTo reduce memory usage during attention computation

DThe scaling factor is arbitrary and has no effect on performance

Question 2 Short Answer

Transformers replaced RNNs by using self-attention. What is the key advantage of self-attention over RNNs for sequential processing?

Think about your answer, then reveal below.

Question 3 Multiple Choice

A transformer with multiple attention heads computes attention separately for each head, then concatenates the results. Why use multiple heads instead of one large attention mechanism?

AMultiple heads have no advantage; they are purely for computational efficiency

BMultiple heads allow the model to attend to different types of relationships in different head; some heads may focus on syntax, others on semantics

CMultiple heads reduce overfitting by regularizing attention

DMultiple heads increase model capacity without changing parameter count

Question 4 True / False

Transformer architecture has remained largely unchanged since 2017 (self-attention + feed-forward layers + layer norm + residuals). Why has this simple architecture proven so effective across diverse domains?

TTrue

FFalse