English text has an alphabet of 27 characters (26 letters + space). If characters were i.i.d. uniform, H = log2(27) ≈ 4.76 bits/character. Shannon estimated the entropy rate of English at about 1.0-1.5 bits/character. Why the huge gap?
AEnglish uses fewer than 27 characters in practice
BEnglish has massive redundancy: character frequencies are highly non-uniform, and sequential dependencies (digrams, trigrams, word structure, grammar, semantics) reduce the conditional entropy far below the marginal entropy
CShannon's estimate was inaccurate
DThe i.i.d. model is correct for text; the gap is due to measurement error
The marginal entropy of English characters (accounting for frequency alone) is about 4.1 bits. But H(X_n | X_{n-1}) is lower because 'qu' is much more likely than 'qz'. H(X_n | X_{n-1}, X_{n-2}) is even lower. As you condition on more context — word boundaries, grammar, topic, world knowledge — the conditional entropy drops to about 1.0-1.5 bits/character. This means English text is about 75% redundant, which is why text compression works so well: gzip achieves about 2 bits/character, and GPT-level language models approach Shannon's estimate.
Question 2 True / False
For a stationary process, both limits lim (1/n)H(X_1,...,X_n) and lim H(X_n | X_{n-1},...,X_1) exist and are equal.
TTrue
FFalse
Answer: True
For a stationary process, H(X_n | X_{n-1},...,X_1) is non-increasing in n (conditioning on more cannot increase entropy) and bounded below by 0, so it converges. Cesaro's theorem then guarantees that the per-symbol entropy (1/n)H(X_1,...,X_n) converges to the same limit. The first is the per-symbol entropy of blocks; the second is the instantaneous conditional entropy. Their equality means the compression limit per symbol is the same whether you measure it by block coding efficiency or by predictability of the next symbol.
Question 3 Short Answer
Compute the entropy rate of a binary Markov chain where P(0|0) = 0.9, P(1|0) = 0.1, P(0|1) = 0.5, P(1|1) = 0.5, and explain what the result tells you about compressing sequences from this chain.
Think about your answer, then reveal below.
Model answer: The stationary distribution satisfies pi_0 * 0.1 = pi_1 * 0.5, giving pi_0 = 5/6, pi_1 = 1/6. The entropy rate is H' = pi_0 * H(0.1) + pi_1 * H(0.5) = (5/6)(0.469) + (1/6)(1.0) = 0.391 + 0.167 = 0.558 bits/symbol. The marginal entropy H(X) = H(pi_0) = H(5/6) ≈ 0.650 bits. The entropy rate (0.558) is lower than the marginal entropy (0.650) because the Markov dependencies provide predictive information — knowing the current state reduces uncertainty about the next. A compressor using a first-order Markov model can achieve 0.558 bits/symbol, while an i.i.d. compressor would need 0.650 bits/symbol.
The gap H(X) - H' = 0.092 bits/symbol represents the predictive information in the sequential dependencies. State 0 (pi=5/6, low entropy transitions) contributes most of the output and is very predictable. State 1 (pi=1/6, maximum entropy transitions) is unpredictable but rare. The entropy rate captures this mixture.