A random variable X takes four values, each with probability 1/4. What is H(X), and why does this value have a natural interpretation in terms of binary encoding?
AH(X) = 4 bits, because there are 4 possible outcomes
BH(X) = 2 bits, because log2(4) = 2, meaning two binary questions perfectly identify the outcome
CH(X) = 1 bit, because each outcome has the same probability
DH(X) = 0 bits, because there is no uncertainty when all outcomes are equally likely
H(X) = -4*(1/4)*log2(1/4) = -4*(1/4)*(-2) = 2 bits. This means you need exactly 2 binary questions (bits) to identify which of 4 equally likely outcomes occurred. Entropy measures average surprise: each outcome contributes -log2(1/4) = 2 bits of surprise, and averaging over all outcomes gives 2. Entropy equals log2(n) for a uniform distribution over n outcomes — this is the maximum entropy for n outcomes.
Question 2 Multiple Choice
A source emits symbol A with probability 0.99 and symbol B with probability 0.01. Which statement about H(X) is correct?
AH(X) is close to 1 bit because there are two symbols
BH(X) is close to 0 bits because the outcome is nearly certain — most of the time there is very little surprise
CH(X) equals exactly 0 bits because one probability dominates
DH(X) is negative because one probability is very small
H(X) = -0.99*log2(0.99) - 0.01*log2(0.01) ≈ 0.081 bits. When one outcome dominates, there is very little uncertainty on average — you almost always see A, which carries negligible surprise. The rare B carries high surprise (-log2(0.01) ≈ 6.64 bits), but it occurs so infrequently that its contribution to the average is small. Entropy reaches its maximum of 1 bit for two symbols only when both are equally likely (p = 0.5).
Question 3 True / False
Shannon entropy can be negative for discrete random variables.
TTrue
FFalse
Answer: False
Shannon entropy for discrete random variables is always non-negative: H(X) >= 0. Each term -p(x)*log(p(x)) is non-negative because 0 <= p(x) <= 1, so log(p(x)) <= 0, making -p(x)*log(p(x)) >= 0. The sum of non-negative terms is non-negative. H(X) = 0 only when the distribution is degenerate (one outcome has probability 1). Note: differential entropy (the continuous analog) CAN be negative, but discrete Shannon entropy cannot.
Question 4 Short Answer
Explain why entropy is maximized by the uniform distribution over a finite alphabet, and what this reveals about the relationship between entropy and knowledge.
Think about your answer, then reveal below.
Model answer: The uniform distribution maximizes entropy because it represents maximum ignorance — every outcome is equally plausible, so there is no way to predict the next symbol better than random guessing. Mathematically, this can be proved using Jensen's inequality or Lagrange multipliers: subject to the constraint that probabilities sum to 1, H(X) = -sum p(x) log p(x) is maximized when all p(x) = 1/n, giving H(X) = log(n). Any deviation from uniformity — any structure or predictability — reduces entropy. This reveals that entropy measures what you DON'T know: the more predictable a source is, the lower its entropy, because there is less genuine uncertainty to resolve.
This maximum-entropy property connects to the maximum entropy principle in statistical mechanics and Bayesian inference: when you have no information beyond constraints, the distribution that maximizes entropy is the least presumptuous choice.