The entropy rate H' = lim_{n->inf} H(X_n | X_{n-1}, ..., X_1) measures the average information per symbol in a stochastic process, accounting for all dependencies between symbols. For i.i.d. processes, H' = H(X). For stationary Markov chains, H' = sum_i pi_i * H(X_n | X_{n-1} = i), where pi is the stationary distribution — the entropy rate depends only on the transition probabilities weighted by the stationary distribution. The entropy rate is the true compression limit for the process: any lossless compressor must use at least H' bits per symbol, and this can be approached by compressors that model the dependencies.
Shannon entropy H(X) measures uncertainty per symbol when symbols are independent. Real data sources — language, video, financial time series, DNA — have extensive sequential dependencies. The entropy rate extends Shannon entropy to stochastic processes, capturing the true information content per symbol after accounting for all temporal correlations.
For a stationary process {X_n}, the entropy rate is H' = lim_{n->inf} H(X_n | X_{n-1}, ..., X_1). Each additional conditioning variable can only reduce entropy (information never hurts), so the sequence H(X_1), H(X_2|X_1), H(X_3|X_2,X_1), ... is non-increasing and bounded below by 0. It must converge. The limit H' is the irreducible uncertainty per symbol — the part that cannot be predicted even with complete knowledge of the entire past.
For a stationary Markov chain with transition matrix P and stationary distribution pi, the entropy rate simplifies beautifully: H' = sum_i pi_i * H(row_i of P) = -sum_i pi_i sum_j P_{ij} log P_{ij}. Only the current state matters for predicting the next symbol, so all higher-order conditioning adds nothing. The entropy rate is a weighted average of the per-state transition entropies, weighted by how often each state is visited.
The entropy rate is the operational compression limit for the process. The source coding theorem for stationary ergodic sources (the Shannon-McMillan-Breiman theorem) states that -(1/n) log p(X_1, ..., X_n) converges to H' almost surely. This means lossless compression of long sequences from the process requires at least H' bits per symbol. A compressor that models the process dependencies (a Markov model, a language model, an LZ algorithm that captures repeated patterns) can approach H', while an i.i.d. compressor wastes bits by ignoring correlations. The better the model, the closer the compression rate to H'. This is why modern language-model-based compressors achieve compression rates approaching Shannon's estimate for English: they model the deep sequential structure that determines H'.
No topics depend on this one yet.