An information-theoretic lower bound proves that estimating a d-dimensional parameter to epsilon accuracy requires at least Omega(d/epsilon^2) samples. A colleague proposes an algorithm that uses only O(sqrt(d)/epsilon^2) samples. What can you conclude?
AThe colleague's algorithm is incorrect — it must have a bug, because no algorithm can beat the lower bound
BThe lower bound may not apply to the colleague's specific problem setting — lower bounds hold for worst-case instances within the stated class, and the colleague's algorithm may exploit additional structure not present in the lower-bound construction
CLower bounds are only proved for Bayesian methods, so the colleague's frequentist algorithm can beat them
DThe lower bound is likely wrong because information-theoretic bounds are known to be loose
Information-theoretic lower bounds are unconditional — they apply to ALL algorithms. If the lower bound is correct for the stated problem class, no algorithm can beat it. Either: (1) the colleague's algorithm does not actually achieve the claimed accuracy (there is an error in the analysis); or (2) the colleague's problem has additional structure (e.g., sparsity, bounded norm, low-rank) that places it outside the class for which the lower bound was proved. Lower bounds are precise about their scope — the problem class, loss function, and distributional assumptions must match exactly. Checking these conditions carefully is the first step in resolving the apparent contradiction.
Question 2 Multiple Choice
Fano's inequality states that the probability of error in identifying a hypothesis from data is at least 1 - (I(X; Y) + 1) / log(M), where M is the number of hypotheses and I(X; Y) is the mutual information between the data and the hypothesis. Why is mutual information the right quantity here?
AMutual information measures computational complexity, which limits the algorithm's ability to process information
BMutual information quantifies the total amount of statistical information the data provides about which hypothesis is true — if the data carries little information about the identity of the hypothesis, no algorithm can reliably identify it
CMutual information is the only quantity that can be computed in closed form for Gaussian distributions
DMutual information measures the noise level in the data, and higher noise means harder learning
Mutual information I(X; Y) measures how much the observation Y reduces uncertainty about the hidden variable X (which hypothesis is the true one). If the data does not carry much information about the hypothesis — because the hypotheses produce similar data distributions — then the mutual information is low, and Fano's inequality guarantees that error probability is high. This captures the fundamental limit: learning is impossible when the data cannot distinguish between alternatives, regardless of how sophisticated the algorithm is. The quantity is information-theoretic, not computational — it applies even with infinite computing power.
Question 3 True / False
Information-theoretic lower bounds apply only to specific algorithms (like ERM or gradient descent), not to all possible learning methods.
TTrue
FFalse
Answer: False
This is the defining feature that distinguishes information-theoretic lower bounds from computational lower bounds. Information-theoretic bounds hold against ALL algorithms — including ones that have not been invented yet and ones with unlimited computational resources. They establish fundamental statistical limits: given n samples from a problem class, no method can achieve error below the lower bound. Computational lower bounds, by contrast, restrict attention to efficient (polynomial-time) algorithms and can be higher than the information-theoretic limit, revealing computational-statistical tradeoffs.
Question 4 Short Answer
Explain Le Cam's method for proving lower bounds and why it reduces the learning problem to a hypothesis testing problem.
Think about your answer, then reveal below.
Model answer: Le Cam's method constructs two specific distributions P_0 and P_1 from the problem class that are hard to distinguish from finite samples. It then argues: if a learner cannot reliably tell whether the data came from P_0 or P_1, it cannot estimate the parameter that distinguishes them. Formally, the minimax estimation error is at least (distance between parameters) * (1 - total variation between P_0^n and P_1^n)/2, where P_i^n is the joint distribution of n samples. If the distributions are close in total variation (hard to tell apart from n samples), the error is large. The reduction to hypothesis testing is natural because learning IS a form of hypothesis testing: the learner must identify which of many possible data-generating processes produced the observed data, and the difficulty of this identification task lower-bounds the learning error.
Le Cam's method is the simplest lower-bound technique but produces tight bounds for many problems. Fano's inequality (the multi-hypothesis generalization) is needed when the parameter space is large and the lower bound must involve many hard-to-distinguish alternatives, not just two.