Transformer Architecture

Research Depth 66 in the knowledge graph I know this Set as goal
Unlocks 7 downstream topics
deep-learning attention neural-architecture

Core Idea

Transformers replace RNNs with self-attention and feedforward layers, enabling parallel sequence processing. Positional encodings inject order information. Encoder-decoder structure processes inputs and generates outputs autoregressively without recurrence.

Explainer

Recurrent networks process sequences one token at a time, maintaining a hidden state that carries information forward. This sequential nature creates two problems: it prevents parallelization (each step waits for the previous one), and information from early tokens must survive through many compression steps to reach the end — a bottleneck that attention mechanisms only partially fix. The transformer architecture eliminates recurrence entirely. Every token attends to every other token directly through self-attention, meaning that relationships between distant tokens are captured in a single operation rather than being passed through a chain of hidden states.

The core mechanism is scaled dot-product attention, which you know from your study of attention mechanisms. Each token is projected into three vectors — a query (Q), a key (K), and a value (V) — using learned linear transformations (the matrix operations from your prerequisites). Attention scores are computed as the dot product of each query with all keys, scaled by √dₖ to prevent the softmax from saturating, then used to weight the values. In self-attention, the queries, keys, and values all come from the same sequence, so every token computes a weighted combination of all other tokens in the sequence. This is done in parallel across all positions — no sequential bottleneck. Multi-head attention runs several independent attention operations in parallel, each with its own Q/K/V projections, allowing the model to attend to different types of relationships simultaneously (one head might capture syntactic structure while another captures semantic similarity).

Since self-attention treats the input as an unordered set, the model needs explicit information about token order. Positional encodings — fixed sinusoidal functions or learned vectors — are added to the input embeddings to provide this. Each transformer layer then applies self-attention followed by a position-wise feedforward network (two linear transformations with a nonlinearity between them), with residual connections and layer normalization around each sub-layer. Stacking multiple such layers creates a deep network where each layer refines the representations produced by the layer below.

The full transformer follows an encoder-decoder structure. The encoder processes the input through self-attention layers, producing contextualized representations. The decoder generates output tokens autoregressively: it uses masked self-attention (preventing positions from attending to future tokens, since those have not been generated yet) and cross-attention (attending to the encoder's output, exactly like the attention in seq2seq models). At inference time, the decoder generates one token at a time, appending each prediction to the input for the next step. Because all attention operations are matrix multiplications over the full sequence, training is massively parallelizable on GPUs — the key practical advantage that enabled scaling to billions of parameters. Transformers now underpin virtually all state-of-the-art language models, from BERT (encoder-only) to GPT (decoder-only) to T5 (encoder-decoder).

Practice Questions 5 questions

Prerequisite Chain

Counting to 10Counting to 20Understanding ZeroThe Number ZeroCounting to FiveOne-to-One CorrespondenceCombining Small Groups Within 5Addition Within 10Addition Within 20Two-Digit Addition Without RegroupingTwo-Digit Addition with RegroupingAddition Within 100Repeated Addition as MultiplicationMultiplication Facts Within 100Division as Equal SharingDivision as Grouping (Measurement Division)Division: Grouping (Repeated Subtraction) ModelDivision: Fair Sharing ModelDivision as Equal SharingDivision as GroupingBasic Division FactsDivision Facts Within 100Two-Digit by One-Digit DivisionDivision with RemaindersRemainders and Quotients in DivisionDivision Word ProblemsIntroduction to Long DivisionFactors and MultiplesPrime and Composite NumbersEquivalent FractionsRelating Fractions and DecimalsDecimal Place ValueReading and Writing DecimalsComparing and Ordering DecimalsAdding and Subtracting DecimalsMultiplying DecimalsDividing DecimalsDividing FractionsMixed Number ArithmeticOrder of OperationsInteger Order of OperationsVariable ExpressionsCombining Like TermsOne-Step EquationsTwo-Step EquationsSolving Multi-Step EquationsEquations with Variables on Both SidesLiteral EquationsSlope-Intercept FormPoint-Slope FormWriting Linear EquationsParallel and Perpendicular Line SlopesGraphing Linear EquationsPiecewise FunctionsStep FunctionsComposition of FunctionsInverse FunctionsRadical Functions and GraphsRational ExponentsExponential Functions and GraphsGeometric Sequences and SeriesSigma NotationExpected ValueLinear Regression in Machine LearningNeural Network FundamentalsAttention MechanismsTransformer Architecture

Longest path: 67 steps · 406 total prerequisite topics

Prerequisites (5)

Leads To (2)