← Graph View All Domains

A topic in the Open Knowledge Graph — a free, open map of 15,290 topics and the order to learn them in.

Source Coding Theorem

Graduate Depth 91 in the knowledge graph ☐ I know this ☆ Set as goal

21topics build on this

413prerequisites beneath it

See this on the map →

Probability Density Functions Shannon Entropy→→Arithmetic Coding Data Compression Basics +6 more

Core Idea

Shannon's source coding theorem (noiseless coding theorem) states that the entropy H(X) of a source is the fundamental limit of lossless compression. No lossless code can achieve an average rate below H(X) bits per symbol, and there exist codes that get arbitrarily close to H(X). For a sequence of n i.i.d. symbols, the compressed length approaches nH(X) bits as n grows. This theorem established entropy as the operationally meaningful measure of information content and launched the field of information theory.

Explainer

Shannon entropy tells you the average uncertainty per symbol. The source coding theorem gives this number operational teeth: entropy is the compression limit. If a source has entropy H bits per symbol, you need at least H bits per symbol on average to represent the output losslessly, and you can get arbitrarily close to H with a sufficiently clever code.

The achievability proof relies on the concept of typical sequences. For a long sequence of n i.i.d. symbols from a source with entropy H, the law of large numbers implies that the empirical frequency of each symbol concentrates around its true probability. The "typical set" — sequences whose empirical statistics match the source distribution — contains approximately 2^nH sequences, even though the total number of possible sequences is |alphabet|^n = 2^{n log |alphabet|}. Since 2^nH << 2^{n log |alphabet|} when H < log|alphabet|, we only need nH bits to index the typical sequences. Atypical sequences have negligible total probability and can be handled with a small overhead.

The converse — that you cannot beat H — follows from the non-negativity of KL divergence. Any lossless code induces a probability distribution over codewords, and the expected codeword length is at least the entropy of the source. Attempting to assign shorter codewords to all symbols violates the Kraft inequality (the constraint that ensures unique decodability). Shorter codes for some symbols necessarily mean longer codes for others, and the probability-weighted average cannot go below H.

The practical impact is enormous. Before Shannon, engineers designed compression schemes by intuition and heuristics. After Shannon, there was a target: entropy. Huffman coding (1952) achieves within 1 bit of entropy per symbol. Arithmetic coding achieves within a fraction of a bit. Modern compressors (gzip, zstd, lossless PNG) all operate in the shadow of this theorem, exploiting statistical redundancy to approach the entropy rate of their input. The theorem also explains why some data cannot be compressed: truly random data has maximum entropy, and no algorithm can shrink it.

Practice Questions 3 questions