Data compression reduces the number of bits needed to represent information by exploiting redundancy. Lossless compression (Huffman, arithmetic, LZ77, LZ78) allows perfect reconstruction and is bounded by the source entropy. Lossy compression (JPEG, MP3, H.264) sacrifices some fidelity for much higher compression ratios, governed by rate-distortion theory. Practical compressors combine a model (which identifies redundancy and predicts symbols) with a coder (which converts predictions into a compact bitstream). The model captures statistical structure; the coder approaches the entropy of the model's predictions.
Data compression is the practical application of Shannon's source coding theorem. The theorem tells you the limit (entropy); compression algorithms try to approach it. Understanding compression requires separating two concerns: the model (what statistical structure does the data have?) and the coder (how do we turn predicted statistics into bits?).
Lossless compression guarantees perfect reconstruction. Entropy coding (Huffman, arithmetic) is the coder; it converts symbol probabilities into near-optimal bit assignments. But the entropy of raw bytes is high. The model's job is to reduce the effective entropy by capturing redundancy. LZ77 (used in gzip, DEFLATE) captures repeated substring patterns by replacing them with backward references. LZ78/LZW (used in classic GIF) builds a dictionary of seen patterns. PPM (prediction by partial matching) uses character-level context to predict the next symbol. Burrows-Wheeler Transform (used in bzip2) rearranges data to group similar characters together, making the result more compressible by subsequent entropy coding.
Lossy compression allows controlled degradation of the reconstructed output. JPEG transforms image blocks into frequency coefficients (via DCT), quantizes them (the lossy step — small coefficients are zeroed), and entropy-codes the result. MP3 uses a psychoacoustic model to identify sounds the human ear cannot perceive, then removes them. H.264/H.265 video codecs combine motion prediction (modeling temporal redundancy) with spatial prediction and quantization. In each case, the lossy step reduces the effective entropy of what remains, allowing the entropy coder to produce a much smaller output than any lossless method could.
The information-theoretic perspective unifies these: every compressor is trying to approach the entropy of the source (lossless) or the rate-distortion function (lossy). The gap between a compressor's output size and the theoretical limit reveals inefficiency — either the model is not capturing all the redundancy, or the coder is not converting predictions to bits efficiently. Modern machine-learned compressors (neural codecs for images, audio, and video) demonstrate that better models consistently yield better compression, validating the information-theoretic framework. Shannon's theorems do not just provide limits; they provide a blueprint.