A topic in the Open Knowledge Graph — a free, open map of 15,290 topics and the order to learn them in.

Transformer Architecture

Research Depth 93 in the knowledge graph ☐ I know this ☆ Set as goal

7topics build on this

639prerequisites beneath it

Attention Mechanisms Dot Product (Inner Product in R^n)+3 more→→Language Models and Neural Language Modeling Self-Attention and Multi-Head Attention

Core Idea

Transformers replace RNNs with self-attention and feedforward layers, enabling parallel sequence processing. Positional encodings inject order information. Encoder-decoder structure processes inputs and generates outputs autoregressively without recurrence.

Explainer

Recurrent networks process sequences one token at a time, maintaining a hidden state that carries information forward. This sequential nature creates two problems: it prevents parallelization (each step waits for the previous one), and information from early tokens must survive through many compression steps to reach the end — a bottleneck that attention mechanisms only partially fix. The transformer architecture eliminates recurrence entirely. Every token attends to every other token directly through self-attention, meaning that relationships between distant tokens are captured in a single operation rather than being passed through a chain of hidden states.

The core mechanism is scaled dot-product attention, which you know from your study of attention mechanisms. Each token is projected into three vectors — a query (Q), a key (K), and a value (V) — using learned linear transformations (the matrix operations from your prerequisites). Attention scores are computed as the dot product of each query with all keys, scaled by √dₖ to prevent the softmax from saturating, then used to weight the values. In self-attention, the queries, keys, and values all come from the same sequence, so every token computes a weighted combination of all other tokens in the sequence. This is done in parallel across all positions — no sequential bottleneck. Multi-head attention runs several independent attention operations in parallel, each with its own Q/K/V projections, allowing the model to attend to different types of relationships simultaneously (one head might capture syntactic structure while another captures semantic similarity).

Since self-attention treats the input as an unordered set, the model needs explicit information about token order. Positional encodings — fixed sinusoidal functions or learned vectors — are added to the input embeddings to provide this. Each transformer layer then applies self-attention followed by a position-wise feedforward network (two linear transformations with a nonlinearity between them), with residual connections and layer normalization around each sub-layer. Stacking multiple such layers creates a deep network where each layer refines the representations produced by the layer below.

The full transformer follows an encoder-decoder structure. The encoder processes the input through self-attention layers, producing contextualized representations. The decoder generates output tokens autoregressively: it uses masked self-attention (preventing positions from attending to future tokens, since those have not been generated yet) and cross-attention (attending to the encoder's output, exactly like the attention in seq2seq models). At inference time, the decoder generates one token at a time, appending each prediction to the input for the next step. Because all attention operations are matrix multiplications over the full sequence, training is massively parallelizable on GPUs — the key practical advantage that enabled scaling to billions of parameters. Transformers now underpin virtually all state-of-the-art language models, from BERT (encoder-only) to GPT (decoder-only) to T5 (encoder-decoder).

Practice Questions 5 questions

Prerequisite Chain

Understanding Zero → The Number Zero → Counting to Five → Counting to 10 → Counting to 20 → Counting a Set of Objects Up to 20 → Cardinality: The Last Number Counted → Matching Numerals to Quantities → Subitizing Small Quantities → Addition Within 10 → Number Bonds to 10 → Addition Within 20 → Doubles and Near Doubles → Doubles Facts Within 10 → Near Doubles Facts Within 20 → Mental Math Strategies for Addition → Mental Math: Adding and Subtracting Tens → Addition Within 100 → Repeated Addition as Multiplication → Multiplication as Equal Groups → Multiplication: Arrays → Basic Multiplication Facts (0s, 1s, 2s, 5s, 10s) → Multiplication Facts Within 100 → Division as Equal Sharing → Division as Grouping (Measurement Division) → Division: Grouping (Repeated Subtraction) Model → Division: Fair Sharing Model → Division as Equal Sharing → Division as Grouping → Basic Division Facts → Division Facts Within 100 → Multiplication and Division Fact Families → Relationship Between Multiplication and Division → Division Facts as Inverse of Multiplication → Remainders and Quotients in Division → Division Word Problems → Multi-Step Word Problems → Solving Multi-Step Word Problems → Multiplication Word Problems → Division Word Problems → Introduction to Long Division → Factors and Multiples → Prime and Composite Numbers → Equivalent Fractions → Relating Fractions and Decimals → Decimal Place Value → Integers and the Number Line → Comparing and Ordering Integers → Absolute Value → Adding Integers → Subtracting Integers → Multiplying Integers → Dividing Integers → Unit Rates → Proportions → Percent Concept → Converting Between Fractions, Decimals, and Percents → Operations with Rational Numbers → Two-Step Equations → Solving Multi-Step Equations → Equations with Variables on Both Sides → Angle Pairs: Complementary, Supplementary, and Vertical → Parallel Lines and Transversals → Corresponding Angles → Alternate Interior Angles → Triangle Angle Sum Theorem → Exterior Angle Theorem → Triangle Inequality Theorem → Similar Triangles: AA Similarity → Similar Triangles: SSS and SAS Similarity → Proportions in Similar Triangles → Right Triangle Trigonometry Introduction → Sine, Cosine, and Tangent Ratios → Trigonometric Ratios Review → Radian Measure → Converting Between Degrees and Radians → The Unit Circle → Graphing Sine and Cosine → Graphing Tangent and Reciprocal Trigonometric Functions → Derivatives of Trigonometric Functions → Antiderivatives → Indefinite Integrals → Basic Integration Rules → Riemann Sums → Definite Integral Definition → Probability Density Functions and Continuous Distributions → Cumulative Distribution Functions → Continuous Random Variables → Probability Density Functions → Expected Value → Linear Regression in Machine Learning → Neural Network Fundamentals → Attention Mechanisms → Transformer Architecture

Longest path: 94 steps · 639 total prerequisite topics

Prerequisites (5)

Attention Mechanismshard Linear Transformationssoft Matrix Operationssoft Dot Product (Inner Product in R^n)soft Matrix Multiplicationsoft

Leads To (2)

Language Models and Neural Language Modelinghard Self-Attention and Multi-Head Attentionhard