Mutual Information

Graduate Depth 76 in the knowledge graph I know this Set as goal
Unlocks 28 downstream topics
mutual information dependence information symmetric

Core Idea

Mutual information I(X;Y) = H(X) - H(X|Y) = H(Y) - H(Y|X) = H(X) + H(Y) - H(X,Y) measures the amount of information that one random variable provides about another. Unlike correlation, which only captures linear relationships, mutual information detects any statistical dependence. It is symmetric: X tells you as much about Y as Y tells you about X. I(X;Y) = 0 if and only if X and Y are independent. It is always non-negative and is bounded above by min(H(X), H(Y)). Mutual information is the central quantity in channel capacity, feature selection, and information-theoretic analysis of learning.

Explainer

You know that conditional entropy H(Y|X) measures the uncertainty remaining in Y after learning X, and that this is always at most H(Y). The gap — the amount by which knowing X reduces uncertainty about Y — is mutual information: I(X;Y) = H(Y) - H(Y|X). It measures how much information X and Y share.

Mutual information has several equivalent expressions, each offering a different perspective. I(X;Y) = H(X) - H(X|Y) shows how much Y tells you about X. I(X;Y) = H(X) + H(Y) - H(X,Y) shows the "redundancy" between X and Y — how much the sum of individual uncertainties exceeds the joint uncertainty. And I(X;Y) = sum over (x,y) of p(x,y) log(p(x,y) / (p(x)p(y))), which is the KL divergence between the joint distribution and the product of marginals. This last form makes the connection to KL divergence explicit and shows that mutual information measures how far X and Y are from independence.

The key properties make mutual information exceptionally useful. It is non-negative (I(X;Y) >= 0), symmetric (I(X;Y) = I(Y;X)), and zero if and only if X and Y are independent. Unlike correlation, it captures any form of dependence — if there is ANY statistical relationship between X and Y, mutual information will detect it. This generality makes it the gold standard for measuring associations in information theory, machine learning (feature selection, information bottleneck), neuroscience (neural coding), and statistics.

In the context of communication, mutual information plays a starring role. Shannon's channel coding theorem states that the capacity of a noisy channel — the maximum rate at which information can be reliably transmitted — equals the maximum mutual information between the input and output: C = max I(X;Y) over all input distributions. This gives mutual information its operational meaning: it is the amount of useful information that survives the noise. The Venn diagram picture (H(X) and H(Y) as overlapping circles, with I(X;Y) as the overlap) provides a powerful visual intuition that extends to understanding conditional mutual information and the data processing inequality.

Practice Questions 4 questions

Prerequisite Chain

Counting to 10Counting to 20Understanding ZeroThe Number ZeroCounting to FiveOne-to-One CorrespondenceCombining Small Groups Within 5Addition Within 10Addition Within 20Two-Digit Addition Without RegroupingTwo-Digit Addition with RegroupingAddition Within 100Repeated Addition as MultiplicationMultiplication Facts Within 100Division as Equal SharingDivision as Grouping (Measurement Division)Division: Grouping (Repeated Subtraction) ModelDivision: Fair Sharing ModelDivision as Equal SharingDivision as GroupingBasic Division FactsDivision Facts Within 100Two-Digit by One-Digit DivisionDivision with RemaindersRemainders and Quotients in DivisionDivision Word ProblemsIntroduction to Long DivisionFactors and MultiplesPrime and Composite NumbersEquivalent FractionsRelating Fractions and DecimalsDecimal Place ValueReading and Writing DecimalsComparing and Ordering DecimalsAdding and Subtracting DecimalsMultiplying DecimalsDividing DecimalsDividing FractionsMixed Number ArithmeticOrder of OperationsInteger Order of OperationsVariable ExpressionsCombining Like TermsOne-Step EquationsTwo-Step EquationsSolving Multi-Step EquationsEquations with Variables on Both SidesAngle Pairs: Complementary, Supplementary, and VerticalParallel Lines and TransversalsCorresponding AnglesAlternate Interior AnglesTriangle Angle Sum TheoremExterior Angle TheoremTriangle Inequality TheoremSimilar Triangles: AA SimilaritySimilar Triangles: SSS and SAS SimilarityProportions in Similar TrianglesRight Triangle Trigonometry IntroductionTrigonometric Ratios ReviewRadian MeasureConverting Between Degrees and RadiansThe Unit CircleGraphing Sine and CosineGraphing Tangent and Reciprocal Trigonometric FunctionsDerivatives of Trigonometric FunctionsAntiderivativesIndefinite IntegralsBasic Integration RulesRiemann SumsDefinite Integral DefinitionProbability Density Functions and Continuous DistributionsCumulative Distribution FunctionsContinuous Random VariablesProbability Density FunctionsShannon EntropyJoint and Conditional EntropyMutual Information

Longest path: 77 steps · 326 total prerequisite topics

Prerequisites (2)

Leads To (16)