← Graph View All Domains

A topic in the Open Knowledge Graph — a free, open map of 15,290 topics and the order to learn them in.

LSTM and Gated Recurrent Units

Graduate Depth 105 in the knowledge graph ☐ I know this ☆ Set as goal

4topics build on this

775prerequisites beneath it

See this on the map →

Recurrent Neural Networks Matrix Operations +1 more→→Gated Recurrent Units (GRU)Sequence-to-Sequence Models

Core Idea

LSTMs address vanishing gradients via memory cells with input, forget, and output gates controlling information flow. GRUs simplify LSTMs with reset and update gates. Both maintain long-term dependencies better than vanilla RNNs.

How It's Best Learned

Train an LSTM on language modeling, comparing convergence against vanilla RNN and visualizing gate activation patterns.

Common Misconceptions

LSTMs do not guarantee prevention of gradient issues; initialization and learning rates matter. More gates do not always improve performance; GRUs often match LSTM results.

Explainer

Recall from recurrent neural networks that a vanilla RNN processes sequences by passing a hidden state from one time step to the next, applying the same weight matrix at each step. The problem is that during backpropagation through time, gradients are multiplied by this same matrix repeatedly — and if its eigenvalues are less than one, the gradient shrinks exponentially toward zero. After just 10–20 time steps, the gradient signal from early inputs has effectively vanished, making it impossible for the network to learn long-range dependencies like the relationship between a subject at the start of a paragraph and a verb at the end.

The Long Short-Term Memory (LSTM) cell solves this by introducing a separate cell state — a highway that runs through the entire sequence with only linear interactions. Information on this highway can flow unchanged across many time steps because it is not repeatedly squashed through a nonlinear activation. Three gates control what enters and exits the cell state. The forget gate looks at the current input and previous hidden state, then outputs a value between 0 and 1 for each dimension of the cell state — 1 means "keep this entirely," 0 means "erase it." The input gate decides which new information to write into the cell state, and the output gate decides which parts of the cell state to expose as the hidden state for the current time step. Each gate is itself a small neural network (a sigmoid layer), so the LSTM learns when to remember and when to forget.

The Gated Recurrent Unit (GRU) simplifies this architecture by merging the cell state and hidden state into a single vector and using only two gates: a reset gate that controls how much of the previous hidden state to ignore when computing the candidate update, and an update gate that interpolates between the old hidden state and the candidate. The update gate plays the combined role of the LSTM's forget and input gates. Despite having fewer parameters, GRUs often perform comparably to LSTMs on many tasks, and they train faster because there is less computation per time step.

In practice, the choice between LSTM and GRU is empirical. LSTMs tend to have a slight edge on tasks requiring very precise memory control — such as copying sequences or counting nested brackets — because the separate cell state gives them more capacity to hold information without interference. GRUs work well on shorter sequences or when training speed matters. Both architectures share the core insight: instead of forcing all information through a single repeatedly-multiplied hidden state, use learned gates to create controlled pathways for information to persist across time steps. This gating mechanism is what makes sequence modeling on hundreds or thousands of time steps practical.

Practice Questions 5 questions

Prerequisite Chain

Understanding Zero → The Number Zero → Counting to Five → Counting to 10 → Counting to 20 → Counting a Set of Objects Up to 20 → Cardinality: The Last Number Counted → Matching Numerals to Quantities → Subitizing Small Quantities → Addition Within 10 → Number Bonds to 10 → Addition Within 20 → Doubles and Near Doubles → Doubles Facts Within 10 → Near Doubles Facts Within 20 → Mental Math Strategies for Addition → Mental Math: Adding and Subtracting Tens → Addition Within 100 → Repeated Addition as Multiplication → Multiplication as Equal Groups → Multiplication: Arrays → Basic Multiplication Facts (0s, 1s, 2s, 5s, 10s) → Multiplication Facts Within 100 → Division as Equal Sharing → Division as Grouping (Measurement Division) → Division: Grouping (Repeated Subtraction) Model → Division: Fair Sharing Model → Division as Equal Sharing → Division as Grouping → Basic Division Facts → Division Facts Within 100 → Multiplication and Division Fact Families → Relationship Between Multiplication and Division → Division Facts as Inverse of Multiplication → Remainders and Quotients in Division → Division Word Problems → Multi-Step Word Problems → Solving Multi-Step Word Problems → Multiplication Word Problems → Division Word Problems → Introduction to Long Division → Factors and Multiples → Prime and Composite Numbers → Equivalent Fractions → Relating Fractions and Decimals → Decimal Place Value → Integers and the Number Line → Comparing and Ordering Integers → Absolute Value → Adding Integers → Subtracting Integers → Multiplying Integers → Dividing Integers → Unit Rates → Proportions → Percent Concept → Converting Between Fractions, Decimals, and Percents → Operations with Rational Numbers → Two-Step Equations → Solving Multi-Step Equations → Equations with Variables on Both Sides → Angle Pairs: Complementary, Supplementary, and Vertical → Parallel Lines and Transversals → Corresponding Angles → Alternate Interior Angles → Triangle Angle Sum Theorem → Exterior Angle Theorem → Triangle Inequality Theorem → Similar Triangles: AA Similarity → Similar Triangles: SSS and SAS Similarity → Proportions in Similar Triangles → Right Triangle Trigonometry Introduction → Sine, Cosine, and Tangent Ratios → Trigonometric Ratios Review → Radian Measure → Converting Between Degrees and Radians → The Unit Circle → Graphing Sine and Cosine → Graphing Tangent and Reciprocal Trigonometric Functions → Derivatives of Trigonometric Functions → Antiderivatives → Indefinite Integrals → Basic Integration Rules → Riemann Sums → Definite Integral Definition → Probability Density Functions and Continuous Distributions → Cumulative Distribution Functions → Continuous Random Variables → Probability Density Functions → Expected Value → Weak Law of Large Numbers → Probability Axioms and Rules → Conditional Probability → Conditional Distributions → Conditional Expectation → Markov Chains → Markov Decision Processes → Introduction to Reinforcement Learning → Policy Gradient Methods → Policy Networks and Policy Gradients → Actor-Critic Methods → Temporal Difference Learning → Q-Learning Algorithm → Deep Q-Networks (DQN) → Recurrent Neural Networks → LSTM and Gated Recurrent Units

Longest path: 106 steps · 775 total prerequisite topics

Prerequisites (3)

Recurrent Neural Networkshard Partial Derivatives: Definition and Computationsoft Matrix Operationssoft

Leads To (2)

Gated Recurrent Units (GRU)hard Sequence-to-Sequence Modelshard