Mutual information I(X;Y) = H(X) - H(X|Y) = H(Y) - H(Y|X) = H(X) + H(Y) - H(X,Y) measures the amount of information that one random variable provides about another. Unlike correlation, which only captures linear relationships, mutual information detects any statistical dependence. It is symmetric: X tells you as much about Y as Y tells you about X. I(X;Y) = 0 if and only if X and Y are independent. It is always non-negative and is bounded above by min(H(X), H(Y)). Mutual information is the central quantity in channel capacity, feature selection, and information-theoretic analysis of learning.
You know that conditional entropy H(Y|X) measures the uncertainty remaining in Y after learning X, and that this is always at most H(Y). The gap — the amount by which knowing X reduces uncertainty about Y — is mutual information: I(X;Y) = H(Y) - H(Y|X). It measures how much information X and Y share.
Mutual information has several equivalent expressions, each offering a different perspective. I(X;Y) = H(X) - H(X|Y) shows how much Y tells you about X. I(X;Y) = H(X) + H(Y) - H(X,Y) shows the "redundancy" between X and Y — how much the sum of individual uncertainties exceeds the joint uncertainty. And I(X;Y) = sum over (x,y) of p(x,y) log(p(x,y) / (p(x)p(y))), which is the KL divergence between the joint distribution and the product of marginals. This last form makes the connection to KL divergence explicit and shows that mutual information measures how far X and Y are from independence.
The key properties make mutual information exceptionally useful. It is non-negative (I(X;Y) >= 0), symmetric (I(X;Y) = I(Y;X)), and zero if and only if X and Y are independent. Unlike correlation, it captures any form of dependence — if there is ANY statistical relationship between X and Y, mutual information will detect it. This generality makes it the gold standard for measuring associations in information theory, machine learning (feature selection, information bottleneck), neuroscience (neural coding), and statistics.
In the context of communication, mutual information plays a starring role. Shannon's channel coding theorem states that the capacity of a noisy channel — the maximum rate at which information can be reliably transmitted — equals the maximum mutual information between the input and output: C = max I(X;Y) over all input distributions. This gives mutual information its operational meaning: it is the amount of useful information that survives the noise. The Venn diagram picture (H(X) and H(Y) as overlapping circles, with I(X;Y) as the overlap) provides a powerful visual intuition that extends to understanding conditional mutual information and the data processing inequality.