A topic in the Open Knowledge Graph — a free, open map of 15,290 topics and the order to learn them in.

Variational Autoencoders (VAE)

Research Depth 105 in the knowledge graph ☐ I know this ☆ Set as goal

783prerequisites beneath it

Autoencoders for Unsupervised Learning Probability Density Functions +4 more→

Core Idea

Variational autoencoders add probabilistic structure by encoding inputs into latent distributions (usually Gaussian) and decoding samples from these distributions. The ELBO (evidence lower bound) loss combines reconstruction error and KL divergence regularization that encourages the latent distribution to match a standard prior, enabling generative sampling and learning interpretable latent representations.

How It's Best Learned

Implement VAE on image data and observe how the latent space enables interpolation between examples and how the KL term affects representation quality and generativeness.

Explainer

A standard autoencoder, which you have already studied, compresses an input into a low-dimensional code and reconstructs the original from that code. It learns useful representations, but it has a fundamental limitation as a generative model: the latent space has no structure. If you pick a random point in the latent space, the decoder may produce garbage, because nothing during training forced nearby points to decode into similar or meaningful outputs. Variational autoencoders fix this by imposing probabilistic structure on the latent space, turning the autoencoder from a compression tool into a principled generative model.

The key idea is that instead of encoding an input x into a single latent vector z, the encoder outputs the parameters of a probability distribution — typically the mean μ and variance σ² of a Gaussian. To get a latent code, you sample z from this distribution: z ~ N(μ, σ²). The decoder then reconstructs x from the sampled z. This means the decoder must handle a range of z values near μ, not just one point, which forces the latent space to be smooth: nearby points in latent space decode to similar outputs. The sampling step creates a technical challenge — you cannot backpropagate gradients through a random sampling operation. The reparameterization trick solves this by rewriting z = μ + σ · ε where ε ~ N(0, 1). Now μ and σ are deterministic outputs of the encoder, ε is an external random input, and gradients flow cleanly through the computation.

The VAE training objective is the evidence lower bound (ELBO), which combines two terms. The reconstruction loss measures how well the decoder reproduces the input from the sampled z — this is the same idea as in a standard autoencoder. The KL divergence term measures how far the encoder's distribution N(μ, σ²) deviates from a standard normal prior N(0, 1). You know from your study of KL divergence that it quantifies the "distance" between two distributions. By penalizing deviation from the prior, the KL term prevents the encoder from collapsing each input to a narrow spike at a unique point — it forces the latent distributions to overlap and organize into a coherent structure. The full loss is: L = reconstruction loss + KL(q(z|x) ‖ p(z)), and training minimizes this jointly.

The payoff is a latent space you can actually use for generation. Because the KL term pushes all encoder distributions toward the same standard normal, you can sample z ~ N(0, 1) at test time and decode it to generate new data — no input required. The latent space also supports interpolation: linearly blending the latent codes of two inputs and decoding intermediate points produces smooth transitions between them (for example, one face morphing into another). The tradeoff is that VAE outputs tend to be blurrier than those from GANs, because the reconstruction loss averages over the stochastic samples, which smooths out fine details. More sophisticated VAEs address this with richer priors, more expressive decoders, or hierarchical latent structures, but the fundamental architecture — encode to a distribution, sample, decode, regularize with KL — remains the foundation.

Practice Questions 5 questions

Prerequisite Chain

Understanding Zero → The Number Zero → Counting to Five → Counting to 10 → Counting to 20 → Counting a Set of Objects Up to 20 → Cardinality: The Last Number Counted → Matching Numerals to Quantities → Subitizing Small Quantities → Addition Within 10 → Number Bonds to 10 → Addition Within 20 → Doubles and Near Doubles → Doubles Facts Within 10 → Near Doubles Facts Within 20 → Mental Math Strategies for Addition → Mental Math: Adding and Subtracting Tens → Addition Within 100 → Repeated Addition as Multiplication → Multiplication as Equal Groups → Multiplication: Arrays → Basic Multiplication Facts (0s, 1s, 2s, 5s, 10s) → Multiplication Facts Within 100 → Division as Equal Sharing → Division as Grouping (Measurement Division) → Division: Grouping (Repeated Subtraction) Model → Division: Fair Sharing Model → Division as Equal Sharing → Division as Grouping → Basic Division Facts → Division Facts Within 100 → Multiplication and Division Fact Families → Relationship Between Multiplication and Division → Division Facts as Inverse of Multiplication → Remainders and Quotients in Division → Division Word Problems → Multi-Step Word Problems → Solving Multi-Step Word Problems → Multiplication Word Problems → Division Word Problems → Introduction to Long Division → Factors and Multiples → Prime and Composite Numbers → Equivalent Fractions → Relating Fractions and Decimals → Decimal Place Value → Integers and the Number Line → Comparing and Ordering Integers → Absolute Value → Adding Integers → Subtracting Integers → Multiplying Integers → Dividing Integers → Unit Rates → Proportions → Percent Concept → Converting Between Fractions, Decimals, and Percents → Operations with Rational Numbers → Two-Step Equations → Solving Multi-Step Equations → Equations with Variables on Both Sides → Angle Pairs: Complementary, Supplementary, and Vertical → Parallel Lines and Transversals → Corresponding Angles → Alternate Interior Angles → Triangle Angle Sum Theorem → Exterior Angle Theorem → Triangle Inequality Theorem → Similar Triangles: AA Similarity → Similar Triangles: SSS and SAS Similarity → Proportions in Similar Triangles → Right Triangle Trigonometry Introduction → Sine, Cosine, and Tangent Ratios → Trigonometric Ratios Review → Radian Measure → Converting Between Degrees and Radians → The Unit Circle → Graphing Sine and Cosine → Graphing Tangent and Reciprocal Trigonometric Functions → Derivatives of Trigonometric Functions → Antiderivatives → Indefinite Integrals → Basic Integration Rules → Riemann Sums → Definite Integral Definition → Probability Density Functions and Continuous Distributions → Cumulative Distribution Functions → Continuous Random Variables → Probability Density Functions → Expected Value → Weak Law of Large Numbers → Probability Axioms and Rules → Conditional Probability → Conditional Distributions → Conditional Expectation → Markov Chains → Markov Decision Processes → Introduction to Reinforcement Learning → Policy Gradient Methods → Policy Networks and Policy Gradients → Actor-Critic Methods → Temporal Difference Learning → Q-Learning Algorithm → Deep Q-Networks (DQN) → Generative Adversarial Networks → Variational Autoencoders (VAE)

Longest path: 106 steps · 783 total prerequisite topics

Prerequisites (6)

Autoencoders for Unsupervised Learninghard Probability Density Functionshard Generative Adversarial Networkssoft Discrete Random Variablessoft Expected Valuesoft Probability Density Functions and Continuous Distributionssoft

Leads To (0)

No topics depend on this one yet.