Sequence-to-Sequence Models

Research Depth 80 in the knowledge graph I know this Set as goal
Unlocks 2 downstream topics
nlp sequence-models encoder-decoder

Core Idea

Seq2seq models encode variable-length inputs and decode to variable-length outputs. Attention allows decoders to focus on relevant input parts. Applications include translation, summarization, and question answering. Beam search improves decoding quality.

Explainer

Many important problems involve transforming one sequence into another where the input and output have different lengths. Translating "How are you?" (three words) to "Comment allez-vous ?" (two or three words depending on tokenization), summarizing a paragraph into a sentence, or converting a spoken utterance into a text transcription — none of these fit the fixed-input, fixed-output pattern of standard neural networks. Sequence-to-sequence (seq2seq) models solve this by splitting the problem into two halves: an encoder that reads the entire input and compresses it into a fixed representation, and a decoder that generates the output one token at a time from that representation.

The encoder, typically an LSTM or GRU network you have already studied, processes the input sequence token by token and produces a final hidden state — a dense vector that in principle captures the meaning of the entire input. The decoder is another recurrent network that takes this hidden state as its initial state and generates output tokens autoregressively: at each step, it predicts the next token, feeds that prediction back as input, and continues until it produces a special end-of-sequence token. This architecture elegantly handles variable-length inputs and outputs because the recurrent networks can process sequences of any length, and the hidden state acts as an information bottleneck bridging the two.

The bottleneck, however, is also the weakness. Compressing an entire input paragraph into a single fixed-size vector inevitably loses information, especially for long sequences. This is where attention mechanisms — which you have studied as a prerequisite — transform the architecture. Instead of relying solely on the final encoder hidden state, attention lets the decoder look back at *all* encoder hidden states at each generation step and compute a weighted combination of them. When translating a sentence, the decoder generating the French word for "cat" can attend strongly to the English word "cat" in the input, regardless of how far back it appeared. This alignment between input and output positions dramatically improves performance on long sequences.

During generation, the decoder must choose tokens one at a time, but greedily picking the highest-probability token at each step can lead to suboptimal overall sequences. Beam search addresses this by maintaining the top-k partial sequences (the "beam") at each step and expanding all of them, keeping only the k highest-scoring candidates. With a beam width of 5, for example, the decoder explores 5 promising hypotheses in parallel and selects the best complete sequence at the end. This is a practical compromise between the intractable exhaustive search over all possible outputs and the myopia of greedy decoding. Seq2seq with attention and beam search was the dominant architecture for machine translation and text generation before transformers, and understanding it is essential groundwork for the attention-only architectures that followed.

Practice Questions 5 questions

Prerequisite Chain

Counting to 10Counting to 20Understanding ZeroThe Number ZeroCounting to FiveOne-to-One CorrespondenceCombining Small Groups Within 5Addition Within 10Addition Within 20Two-Digit Addition Without RegroupingTwo-Digit Addition with RegroupingAddition Within 100Repeated Addition as MultiplicationMultiplication Facts Within 100Division as Equal SharingDivision as Grouping (Measurement Division)Division: Grouping (Repeated Subtraction) ModelDivision: Fair Sharing ModelDivision as Equal SharingDivision as GroupingBasic Division FactsDivision Facts Within 100Two-Digit by One-Digit DivisionDivision with RemaindersRemainders and Quotients in DivisionDivision Word ProblemsIntroduction to Long DivisionFactors and MultiplesPrime and Composite NumbersEquivalent FractionsRelating Fractions and DecimalsDecimal Place ValueReading and Writing DecimalsComparing and Ordering DecimalsAdding and Subtracting DecimalsMultiplying DecimalsDividing DecimalsDividing FractionsMixed Number ArithmeticOrder of OperationsInteger Order of OperationsVariable ExpressionsCombining Like TermsOne-Step EquationsTwo-Step EquationsSolving Multi-Step EquationsEquations with Variables on Both SidesLiteral EquationsSlope-Intercept FormPoint-Slope FormWriting Linear EquationsParallel and Perpendicular Line SlopesGraphing Linear EquationsPiecewise FunctionsOne-Sided LimitsContinuity DefinitionLimit Definition of the DerivativePower RuleConstant Multiple and Sum/Difference RulesProduct RuleChain RuleHigher-Order DerivativesConcavity and Inflection PointsSecond Derivative TestCurve SketchingOptimization ProblemsCritical Points of Multivariable FunctionsCritical Points and Classification of ExtremaSecond Partial Test for Local Extrema (Hessian)The Hessian Matrix and Second Derivative TestUnconstrained Optimization: Finding ExtremaOptimization in Multiple VariablesIntroduction to Reinforcement LearningPolicy Gradient MethodsActor-Critic MethodsTemporal Difference LearningQ-Learning AlgorithmDeep Q-Networks (DQN)Recurrent Neural NetworksLSTM and Gated Recurrent UnitsSequence-to-Sequence Models

Longest path: 81 steps · 555 total prerequisite topics

Prerequisites (2)

Leads To (1)