You know only that a die has mean 4.5 (higher than the fair-die mean of 3.5). The maximum entropy distribution subject to this constraint will be:
AUniform over {1,2,3,4,5,6} — MaxEnt always gives uniform distributions
BAn exponential-family distribution that tilts probability toward higher faces, with the tilt parameter determined by the mean constraint — it assigns more probability to 5 and 6 than to 1 and 2
CA point mass on 4.5
DA uniform distribution over {4, 5, 6} only
With only a mean constraint E[X] = 4.5, the MaxEnt distribution is p(k) proportional to exp(lambda * k) for k = 1,...,6, where lambda > 0 is chosen so that E[X] = 4.5. This is a discrete exponential distribution tilted toward higher values. It is NOT uniform (the uniform has mean 3.5, violating the constraint). It assigns positive probability to all faces but more to higher ones. MaxEnt gives the uniform only when the only constraint is that probabilities sum to 1 (no moment constraints).
Question 2 True / False
The maximum entropy distribution for a continuous random variable with known mean mu and variance sigma^2 is the Gaussian N(mu, sigma^2).
TTrue
FFalse
Answer: True
Among all continuous distributions on the real line with mean mu and variance sigma^2, the Gaussian maximizes differential entropy: h(X) = (1/2) log(2*pi*e*sigma^2). This is proved using Lagrange multipliers: the constraints fix the first two moments, and the resulting MaxEnt distribution is the Gaussian (an exponential-family distribution with natural parameters determined by the mean and variance constraints). This is why the Gaussian appears so frequently in information theory: it represents maximum ignorance subject to power (variance) constraints.
Question 3 Short Answer
Explain why the MaxEnt distribution minimizes KL divergence from the uniform distribution (or the specified prior), and what this reveals about the principle's relationship to Bayesian inference.
Think about your answer, then reveal below.
Model answer: Maximizing entropy H(p) = -sum p(x) log p(x) subject to constraints is equivalent to minimizing D_KL(p || u) where u is the uniform distribution, because H(p) = log|X| - D_KL(p || u), and log|X| is constant. So MaxEnt finds the distribution closest to uniform (most ignorant) that satisfies the constraints. More generally, if there is a prior distribution q, the 'minimum relative entropy' principle minimizes D_KL(p || q) subject to constraints — this reduces to MaxEnt when q is uniform. This connects to Bayesian inference: the MaxEnt distribution is the posterior you get from the most uninformative prior consistent with your constraints. Jaynes argued this gives MaxEnt an objective Bayesian justification.
The equivalence MaxEnt <=> min D_KL(p || prior) unifies information theory and Bayesian statistics. It also explains why exponential family distributions appear in both: they arise from MaxEnt under moment constraints AND as conjugate priors in Bayesian analysis. The same mathematical structure underlies both frameworks.