KL Divergence

Graduate Depth 77 in the knowledge graph I know this Set as goal
Unlocks 14 downstream topics
KL divergence Kullback-Leibler relative entropy divergence

Core Idea

The Kullback-Leibler divergence D_KL(P || Q) = sum p(x) log(p(x)/q(x)) measures how much one probability distribution P differs from a reference distribution Q, in units of information. It quantifies the extra bits needed to encode samples from P using a code optimized for Q. KL divergence is always non-negative (Gibbs' inequality), equals zero only when P = Q, and is not symmetric: D_KL(P||Q) != D_KL(Q||P). It is the central tool for comparing distributions in information theory, statistics (likelihood ratio tests), and machine learning (variational inference, training generative models).

Explainer

You have seen that mutual information measures how much two random variables share. KL divergence is the more general tool: it measures how one probability distribution differs from another, and mutual information turns out to be a special case. D_KL(P || Q) = sum over x of p(x) log(p(x)/q(x)) answers: if nature generates data from P, but I designed my encoding assuming Q, how many extra bits per symbol do I waste?

The asymmetry of KL divergence is not a defect — it reflects a real distinction. D_KL(P || Q) measures the cost of using Q when the truth is P. D_KL(Q || P) measures the cost of using P when the truth is Q. These are different situations. In variational inference, minimizing D_KL(q || p) (the "forward" or "exclusive" KL) makes q avoid regions where p is small, producing compact, mode-seeking approximations. Minimizing D_KL(p || q) (the "reverse" or "inclusive" KL) makes q cover all regions where p is large, producing diffuse, mean-seeking approximations. The choice of direction fundamentally shapes the behavior of the approximation.

Gibbs' inequality states that D_KL(P || Q) >= 0 for all distributions P and Q, with equality if and only if P = Q. This is perhaps the most important inequality in information theory. It implies that the entropy H(P) = -sum p(x) log p(x) is the minimum average code length for distribution P — any other distribution Q used for coding adds at least D_KL(P || Q) extra bits. Gibbs' inequality also immediately proves that mutual information is non-negative, since I(X;Y) = D_KL(p(x,y) || p(x)p(y)) >= 0.

KL divergence appears throughout modern machine learning. Cross-entropy loss, the standard training objective for classification, equals H(P) + D_KL(P || Q), where P is the true label distribution and Q is the model's predicted distribution. Minimizing cross-entropy is equivalent to minimizing KL divergence (since H(P) is constant). The evidence lower bound (ELBO) in variational autoencoders involves a KL term. GANs minimize divergences between real and generated distributions. Understanding KL divergence — its asymmetry, its non-negativity, its operational meaning as wasted bits — is essential for reasoning about any system that compares probability distributions.

Practice Questions 4 questions

Prerequisite Chain

Counting to 10Counting to 20Understanding ZeroThe Number ZeroCounting to FiveOne-to-One CorrespondenceCombining Small Groups Within 5Addition Within 10Addition Within 20Two-Digit Addition Without RegroupingTwo-Digit Addition with RegroupingAddition Within 100Repeated Addition as MultiplicationMultiplication Facts Within 100Division as Equal SharingDivision as Grouping (Measurement Division)Division: Grouping (Repeated Subtraction) ModelDivision: Fair Sharing ModelDivision as Equal SharingDivision as GroupingBasic Division FactsDivision Facts Within 100Two-Digit by One-Digit DivisionDivision with RemaindersRemainders and Quotients in DivisionDivision Word ProblemsIntroduction to Long DivisionFactors and MultiplesPrime and Composite NumbersEquivalent FractionsRelating Fractions and DecimalsDecimal Place ValueReading and Writing DecimalsComparing and Ordering DecimalsAdding and Subtracting DecimalsMultiplying DecimalsDividing DecimalsDividing FractionsMixed Number ArithmeticOrder of OperationsInteger Order of OperationsVariable ExpressionsCombining Like TermsOne-Step EquationsTwo-Step EquationsSolving Multi-Step EquationsEquations with Variables on Both SidesAngle Pairs: Complementary, Supplementary, and VerticalParallel Lines and TransversalsCorresponding AnglesAlternate Interior AnglesTriangle Angle Sum TheoremExterior Angle TheoremTriangle Inequality TheoremSimilar Triangles: AA SimilaritySimilar Triangles: SSS and SAS SimilarityProportions in Similar TrianglesRight Triangle Trigonometry IntroductionTrigonometric Ratios ReviewRadian MeasureConverting Between Degrees and RadiansThe Unit CircleGraphing Sine and CosineGraphing Tangent and Reciprocal Trigonometric FunctionsDerivatives of Trigonometric FunctionsAntiderivativesIndefinite IntegralsBasic Integration RulesRiemann SumsDefinite Integral DefinitionProbability Density Functions and Continuous DistributionsCumulative Distribution FunctionsContinuous Random VariablesProbability Density FunctionsShannon EntropyJoint and Conditional EntropyMutual InformationKL Divergence

Longest path: 78 steps · 327 total prerequisite topics

Prerequisites (3)

Leads To (11)