A topic in the Open Knowledge Graph — a free, open map of 15,290 topics and the order to learn them in.

Self-Attention and Multi-Head Attention

Research Depth 94 in the knowledge graph ☐ I know this ☆ Set as goal

640prerequisites beneath it

Attention Mechanisms Transformer Architecture→

Core Idea

Self-attention computes a weighted sum of all positions in a sequence, allowing each position to attend to every other position. Multi-head attention runs multiple self-attention operations in parallel, learning different attention patterns. This mechanism is central to Transformers and enables modeling long-range dependencies more effectively than RNNs.

Explainer

You already understand attention as a mechanism that lets a model focus on relevant parts of an input when producing an output. Self-attention applies this idea within a single sequence — every position attends to every other position in the same sequence, computing how relevant each word (or token) is to every other word. In the sentence "The cat sat on the mat because it was tired," self-attention at the position of "it" can learn to attend strongly to "cat," resolving the pronoun reference. No recurrence or convolution is needed — every pair of positions interacts directly regardless of distance.

The mechanism works through three learned projections. Each input position is projected into a query vector (what am I looking for?), a key vector (what do I contain?), and a value vector (what information do I carry?). Attention scores are computed as the dot product of each query with every key, scaled by √dₖ to prevent the softmax from saturating into a one-hot distribution. After softmax, these scores become weights that determine how much each position's value vector contributes to the output at the query position. The entire operation can be written as Attention(Q, K, V) = softmax(QK^T / √dₖ)V, and because it is expressed as matrix multiplications, it is massively parallelizable on GPUs — a critical advantage over the sequential processing that RNNs require.

A single attention head learns one pattern of relevance — perhaps syntactic dependency, or coreference, or positional proximity. But language requires attending to multiple relationships simultaneously. Multi-head attention addresses this by running h separate attention operations in parallel, each with its own learned Q, K, V projections into a smaller subspace (dimension dₖ/h). The outputs of all heads are concatenated and linearly projected back to the model dimension. In practice, different heads specialize: one might track subject-verb agreement across long distances while another focuses on adjacent-word relationships. This division of labor emerges naturally from training, without explicit supervision.

Self-attention has a key limitation: it is permutation-invariant — the mechanism itself has no notion of word order, since every position interacts with every other position symmetrically. This is why Transformers add positional encodings to the input embeddings, injecting sequence-order information that the attention mechanism can then use. The computational cost is O(n²) in sequence length, since every position attends to every other, which becomes expensive for very long sequences. Despite this quadratic cost, self-attention's ability to directly model relationships between any two positions — without information having to propagate step-by-step through intermediate states — is what makes Transformers so effective at capturing the long-range dependencies that recurrent models struggle with.

Practice Questions 5 questions

Prerequisite Chain

Understanding Zero → The Number Zero → Counting to Five → Counting to 10 → Counting to 20 → Counting a Set of Objects Up to 20 → Cardinality: The Last Number Counted → Matching Numerals to Quantities → Subitizing Small Quantities → Addition Within 10 → Number Bonds to 10 → Addition Within 20 → Doubles and Near Doubles → Doubles Facts Within 10 → Near Doubles Facts Within 20 → Mental Math Strategies for Addition → Mental Math: Adding and Subtracting Tens → Addition Within 100 → Repeated Addition as Multiplication → Multiplication as Equal Groups → Multiplication: Arrays → Basic Multiplication Facts (0s, 1s, 2s, 5s, 10s) → Multiplication Facts Within 100 → Division as Equal Sharing → Division as Grouping (Measurement Division) → Division: Grouping (Repeated Subtraction) Model → Division: Fair Sharing Model → Division as Equal Sharing → Division as Grouping → Basic Division Facts → Division Facts Within 100 → Multiplication and Division Fact Families → Relationship Between Multiplication and Division → Division Facts as Inverse of Multiplication → Remainders and Quotients in Division → Division Word Problems → Multi-Step Word Problems → Solving Multi-Step Word Problems → Multiplication Word Problems → Division Word Problems → Introduction to Long Division → Factors and Multiples → Prime and Composite Numbers → Equivalent Fractions → Relating Fractions and Decimals → Decimal Place Value → Integers and the Number Line → Comparing and Ordering Integers → Absolute Value → Adding Integers → Subtracting Integers → Multiplying Integers → Dividing Integers → Unit Rates → Proportions → Percent Concept → Converting Between Fractions, Decimals, and Percents → Operations with Rational Numbers → Two-Step Equations → Solving Multi-Step Equations → Equations with Variables on Both Sides → Angle Pairs: Complementary, Supplementary, and Vertical → Parallel Lines and Transversals → Corresponding Angles → Alternate Interior Angles → Triangle Angle Sum Theorem → Exterior Angle Theorem → Triangle Inequality Theorem → Similar Triangles: AA Similarity → Similar Triangles: SSS and SAS Similarity → Proportions in Similar Triangles → Right Triangle Trigonometry Introduction → Sine, Cosine, and Tangent Ratios → Trigonometric Ratios Review → Radian Measure → Converting Between Degrees and Radians → The Unit Circle → Graphing Sine and Cosine → Graphing Tangent and Reciprocal Trigonometric Functions → Derivatives of Trigonometric Functions → Antiderivatives → Indefinite Integrals → Basic Integration Rules → Riemann Sums → Definite Integral Definition → Probability Density Functions and Continuous Distributions → Cumulative Distribution Functions → Continuous Random Variables → Probability Density Functions → Expected Value → Linear Regression in Machine Learning → Neural Network Fundamentals → Attention Mechanisms → Transformer Architecture → Self-Attention and Multi-Head Attention

Longest path: 95 steps · 640 total prerequisite topics

Prerequisites (2)

Attention Mechanismshard Transformer Architecturehard

Leads To (0)

No topics depend on this one yet.