← Graph View All Domains

A topic in the Open Knowledge Graph — a free, open map of 15,290 topics and the order to learn them in.

KL Divergence

Graduate Depth 97 in the knowledge graph ☐ I know this ☆ Set as goal

14topics build on this

441prerequisites beneath it

See this on the map →

Mutual Information Probability Density Functions +1 more→→Causal Information Theory Fisher Information +9 more

Core Idea

The Kullback-Leibler divergence D_KL(P || Q) = sum p(x) log(p(x)/q(x)) measures how much one probability distribution P differs from a reference distribution Q, in units of information. It quantifies the extra bits needed to encode samples from P using a code optimized for Q. KL divergence is always non-negative (Gibbs' inequality), equals zero only when P = Q, and is not symmetric: D_KL(P||Q) != D_KL(Q||P). It is the central tool for comparing distributions in information theory, statistics (likelihood ratio tests), and machine learning (variational inference, training generative models).

Explainer

You have seen that mutual information measures how much two random variables share. KL divergence is the more general tool: it measures how one probability distribution differs from another, and mutual information turns out to be a special case. D_KL(P || Q) = sum over x of p(x) log(p(x)/q(x)) answers: if nature generates data from P, but I designed my encoding assuming Q, how many extra bits per symbol do I waste?

The asymmetry of KL divergence is not a defect — it reflects a real distinction. D_KL(P || Q) measures the cost of using Q when the truth is P. D_KL(Q || P) measures the cost of using P when the truth is Q. These are different situations. In variational inference, minimizing D_KL(q || p) (the "forward" or "exclusive" KL) makes q avoid regions where p is small, producing compact, mode-seeking approximations. Minimizing D_KL(p || q) (the "reverse" or "inclusive" KL) makes q cover all regions where p is large, producing diffuse, mean-seeking approximations. The choice of direction fundamentally shapes the behavior of the approximation.

Gibbs' inequality states that D_KL(P || Q) >= 0 for all distributions P and Q, with equality if and only if P = Q. This is perhaps the most important inequality in information theory. It implies that the entropy H(P) = -sum p(x) log p(x) is the minimum average code length for distribution P — any other distribution Q used for coding adds at least D_KL(P || Q) extra bits. Gibbs' inequality also immediately proves that mutual information is non-negative, since I(X;Y) = D_KL(p(x,y) || p(x)p(y)) >= 0.

KL divergence appears throughout modern machine learning. Cross-entropy loss, the standard training objective for classification, equals H(P) + D_KL(P || Q), where P is the true label distribution and Q is the model's predicted distribution. Minimizing cross-entropy is equivalent to minimizing KL divergence (since H(P) is constant). The evidence lower bound (ELBO) in variational autoencoders involves a KL term. GANs minimize divergences between real and generated distributions. Understanding KL divergence — its asymmetry, its non-negativity, its operational meaning as wasted bits — is essential for reasoning about any system that compares probability distributions.

Practice Questions 4 questions

Prerequisite Chain

Understanding Zero → The Number Zero → Counting to Five → Counting to 10 → Counting to 20 → Counting a Set of Objects Up to 20 → Cardinality: The Last Number Counted → Matching Numerals to Quantities → Subitizing Small Quantities → Addition Within 10 → Number Bonds to 10 → Addition Within 20 → Doubles and Near Doubles → Doubles Facts Within 10 → Near Doubles Facts Within 20 → Mental Math Strategies for Addition → Mental Math: Adding and Subtracting Tens → Addition Within 100 → Repeated Addition as Multiplication → Multiplication as Equal Groups → Multiplication: Arrays → Basic Multiplication Facts (0s, 1s, 2s, 5s, 10s) → Multiplication Facts Within 100 → Division as Equal Sharing → Division as Grouping (Measurement Division) → Division: Grouping (Repeated Subtraction) Model → Division: Fair Sharing Model → Division as Equal Sharing → Division as Grouping → Basic Division Facts → Division Facts Within 100 → Multiplication and Division Fact Families → Relationship Between Multiplication and Division → Division Facts as Inverse of Multiplication → Remainders and Quotients in Division → Division Word Problems → Multi-Step Word Problems → Solving Multi-Step Word Problems → Multiplication Word Problems → Division Word Problems → Introduction to Long Division → Factors and Multiples → Prime and Composite Numbers → Equivalent Fractions → Relating Fractions and Decimals → Decimal Place Value → Integers and the Number Line → Comparing and Ordering Integers → Absolute Value → Adding Integers → Subtracting Integers → Multiplying Integers → Dividing Integers → Unit Rates → Proportions → Percent Concept → Converting Between Fractions, Decimals, and Percents → Operations with Rational Numbers → Two-Step Equations → Solving Multi-Step Equations → Equations with Variables on Both Sides → Angle Pairs: Complementary, Supplementary, and Vertical → Parallel Lines and Transversals → Corresponding Angles → Alternate Interior Angles → Triangle Angle Sum Theorem → Exterior Angle Theorem → Triangle Inequality Theorem → Similar Triangles: AA Similarity → Similar Triangles: SSS and SAS Similarity → Proportions in Similar Triangles → Right Triangle Trigonometry Introduction → Sine, Cosine, and Tangent Ratios → Trigonometric Ratios Review → Radian Measure → Converting Between Degrees and Radians → The Unit Circle → Graphing Sine and Cosine → Graphing Tangent and Reciprocal Trigonometric Functions → Derivatives of Trigonometric Functions → Antiderivatives → Indefinite Integrals → Basic Integration Rules → Riemann Sums → Definite Integral Definition → Probability Density Functions and Continuous Distributions → Cumulative Distribution Functions → Continuous Random Variables → Probability Density Functions → Expected Value → Weak Law of Large Numbers → Probability Axioms and Rules → Conditional Probability → Law of Total Probability → Bayes' Theorem → Joint and Conditional Entropy → Mutual Information → KL Divergence

Longest path: 98 steps · 441 total prerequisite topics

Prerequisites (3)

Shannon Entropyhard Mutual Informationhard Probability Density Functionshard

Leads To (11)

Causal Information Theoryhard Fisher Informationsoft Information Bottleneck Theoryhard Information Geometry Advancedhard Information Geometry Basicshard Information Theory and Statistical Inferencehard Information-Theoretic Securityhard Maximum Entropy Principlesoft Minimum Description Lengthsoft Rate-Distortion Theoryhard Rate-Distortion Theory Advancedhard