A machine learning engineer uses mutual information to select features for a classifier. Why might mutual information be preferred over Pearson correlation for feature selection?
AMutual information is faster to compute than correlation
BMutual information detects any statistical dependence (including nonlinear), while Pearson correlation only measures linear association
CMutual information accounts for the causal direction between features and the target
DPearson correlation is undefined for discrete variables
Pearson correlation measures only linear association — if Y = X^2 and X is symmetric around zero, the correlation is zero despite perfect functional dependence. Mutual information captures ALL dependence: linear, nonlinear, categorical, or otherwise. I(X;Y) = 0 if and only if X and Y are truly independent. This makes it a more general measure for feature selection, though it requires density estimation for continuous variables, which adds computational cost.
Question 2 Multiple Choice
I(X;Y) = H(X) + H(Y) - H(X,Y). If X and Y are independent, I(X;Y) = 0. If Y is a deterministic function of X, what is I(X;Y)?
AI(X;Y) = 0 because deterministic relationships contain no randomness
BI(X;Y) = H(X) + H(Y)
CI(X;Y) = H(Y), because knowing X completely determines Y, so H(Y|X) = 0
DI(X;Y) = infinity because the dependence is perfect
If Y = f(X) for some deterministic function f, then H(Y|X) = 0 — there is no residual uncertainty about Y once X is known. So I(X;Y) = H(Y) - H(Y|X) = H(Y) - 0 = H(Y). Equivalently, I(X;Y) = H(X) - H(X|Y), and if f is invertible, then H(X|Y) = 0 as well, giving I(X;Y) = H(X) = H(Y). If f is not invertible (many-to-one), then H(X|Y) > 0 and I(X;Y) = H(Y) < H(X).
Question 3 True / False
Mutual information is symmetric: I(X;Y) = I(Y;X). This means that if knowing X reduces your uncertainty about Y by 2 bits, then knowing Y also reduces your uncertainty about X by 2 bits.
TTrue
FFalse
Answer: True
Symmetry is a fundamental property: I(X;Y) = H(X) - H(X|Y) = H(Y) - H(Y|X) = I(Y;X). The total information shared between X and Y is the same regardless of which direction you look. This is true even though H(X|Y) != H(Y|X) in general — the asymmetry in conditional entropies cancels exactly. Intuitively, mutual information measures the overlap in information content between the two variables, and overlap is inherently symmetric.
Question 4 Short Answer
Explain the Venn diagram interpretation of mutual information and how it relates H(X), H(Y), H(X,Y), H(X|Y), and H(Y|X).
Think about your answer, then reveal below.
Model answer: Imagine two overlapping circles, one for H(X) and one for H(Y). The union is H(X,Y). The overlap (intersection) is I(X;Y) — the shared information. The left crescent (H(X) minus the overlap) is H(X|Y) — the information in X that Y does not capture. The right crescent is H(Y|X). The chain rule gives H(X,Y) = H(X|Y) + I(X;Y) + H(Y|X) = H(X) + H(Y|X) = H(Y) + H(X|Y). For independent variables the circles don't overlap (I=0, H(X,Y)=H(X)+H(Y)). For perfectly dependent variables the circles coincide (I=H(X)=H(Y), H(X|Y)=H(Y|X)=0).
This Venn diagram is one of the most useful mental models in information theory. It breaks down perfectly for two variables but becomes tricky for three or more, where interaction information (the analog of the intersection of three sets) can be negative — meaning three-way redundancy can be negative, unlike set intersections.