The Blahut-Arimoto algorithm iteratively computes the optimal test channel p(x-hat|x) that achieves R(D). Which quantity does it converge to?
AThe algorithm converges to the single-letter mutual information I(X; X-hat)
BThe algorithm converges to a fixed point where the ratio p(x-hat|x) / p(x-hat) satisfies the optimality condition exp(beta * d(x,x-hat)) proportional to p(x-hat|x) / p(x-hat), where beta is the Lagrange multiplier for the distortion constraint
CThe algorithm converges to the channel capacity of the quantization codebook
DThe algorithm converges to the minimum-distance encoding rule
The Blahut-Arimoto algorithm alternates between updating p(x-hat|x) as p(x-hat) * exp(-beta * d(x,x-hat)) / Z(x) and updating p(x-hat) as sum_x p(x) p(x-hat|x). At convergence, the test channel satisfies the optimality KKT condition: the cost of representing x as x-hat (weighted by the Lagrange multiplier beta) is balanced against the probability of x-hat. This fixed point characterizes the rate-distortion tradeoff. The parameter beta increases as D decreases, forcing the algorithm to choose less likely representations to meet tighter distortion constraints.
Question 2 True / False
As the distortion constraint D decreases from D=D_max (where R=0) to D=0 (lossless compression), the optimal test channel transitions from unary (p(x-hat|x) concentrates on a single x-hat) to identity (p(x-hat|x) concentrates on x-hat=x).
TTrue
FFalse
Answer: True
At D=D_max, all symbols can be decoded as the same output x-hat*, so the receiver needs no information (R=0). The test channel assigns all probability to x-hat*. As D decreases, the allowable distortion shrinks, forcing the test channel to become more informative. At D=0, only zero distortion is acceptable, so p(x-hat|x) must be deterministic with x-hat=x. This phase transition from high entropy (low D) to zero entropy (high D) is fundamental to understanding rate-distortion. The transition occurs gradually, but phase-like behavior appears in the derivative dR/dD (the 'bandwidth' of the transition).
Question 3 Short Answer
Explain the connection between rate-distortion theory and the information bottleneck (IB) method in machine learning. How does IB generalize R(D)?
Think about your answer, then reveal below.
Model answer: The information bottleneck method considers three random variables: X (input), Y (output label), and T (compressed representation). IB minimizes I(X;T) - beta*I(T;Y), trading off compression of X (small I(X;T)) against prediction accuracy (large I(T;Y)). Standard rate-distortion considers a single source X and reconstructs X with distortion D. IB extends this: T is the 'bottleneck' representation, I(X;T) is the 'rate' (how much information about X is retained), and I(T;Y) plays the role of a fidelity measure (preserving information about the task Y). When Y is a deterministic function of X, IB reduces to rate-distortion. The Lagrange multiplier beta tunes the tradeoff, with phase transitions occurring at critical beta values (analogous to phase transitions in D in rate-distortion).
IB is rate-distortion theory applied to supervised learning: the encoder compresses X into T such that T still predicts Y well. The information-theoretic principles (monotonicity in beta, phase transitions, Blahut-Arimoto algorithm) directly translate. This connection shows that lossy compression and feature extraction for prediction are fundamentally the same problem viewed through different lenses. Deep learning methods that learn representations (autoencoders, transformers) implicitly perform IB-like compression.
Question 4 Multiple Choice
In remote source coding (source coding with helper), the encoder observes a source X, the helper observes correlated X', and only the encoder can communicate to the decoder. When is it beneficial to have a helper, and what is the rate reduction?
AA helper never reduces rate because the encoder cannot send the helper's observations to the decoder
BA helper reduces rate by min I(X;X') because the helper's side information about X reduces uncertainty
CThe helper reduces rate when X' is highly correlated with X. The rate is R(D | X') = min I(X;X-hat|X'), achievable if the encoder can coordinate coding with the helper (via two-way interaction or shared randomness)
DA helper increases rate due to the overhead of coordinating two sources
In remote source coding, the encoder must encode X while communicating with the helper who observes X'. If the encoder and helper share randomness or can interact, the optimal rate is R(D|X') = min I(X;X-hat|X'), conditioned on the helper's information. The helper effectively reduces the uncertainty the encoder must describe. For instance, if X' is a noisy version of X, the encoder can exploit this correlation to send fewer bits. Wyner's source coding with side information and Slepian-Wolf coding are special cases of this framework.