Huffman Coding: Optimal Prefix Codes via Greedy

College Depth 68 in the knowledge graph I know this Set as goal
Unlocks 23 downstream topics
greedy coding compression

Core Idea

Huffman coding constructs an optimal prefix-free code for a given frequency distribution. Repeatedly merge the two least-frequent nodes into a new parent. The resulting tree encodes frequent symbols with shorter code lengths, minimizing expected code length. Proof via exchange argument shows optimality.

How It's Best Learned

Implement Huffman coding using a min-heap. Build the tree, extract codes, and measure compression on real text. Compare code lengths to fixed-length and other variable-length schemes.

Common Misconceptions

Explainer

You already know that a greedy algorithm builds a solution piece by piece, always choosing the locally optimal next step. Huffman coding applies this strategy to a specific problem: given a set of symbols with known frequencies, assign binary codes so that the total number of bits used is minimized. The key constraint is that the code must be prefix-free — no codeword is a prefix of another — so the decoder can read a stream of bits and unambiguously determine where each symbol ends without needing delimiters.

The algorithm works bottom-up. Start with each symbol as a leaf node, weighted by its frequency. Repeatedly extract the two nodes with the smallest frequencies — this is where your knowledge of heaps pays off, since a min-heap makes this extraction O(log n) — and merge them into a new internal node whose frequency is their sum. This new node goes back into the heap. Continue until only one node remains: the root of the Huffman tree. Every left branch gets a 0 and every right branch gets a 1, and the code for each symbol is the sequence of bits on the path from root to its leaf.

Why does this produce optimal codes? The intuition is that the two least-frequent symbols should be the deepest in the tree (longest codes), because they contribute the least to the total bit count. The exchange argument formalizes this: if the two least-frequent symbols were not siblings at the maximum depth, you could swap them into that position and reduce or maintain the total cost, contradicting the assumption that the original tree was optimal. By induction, the greedy merging strategy yields the minimum expected code length for any prefix-free code.

Consider a concrete example: if you have symbols A (50%), B (25%), C (15%), D (10%), the algorithm first merges C and D (combined 25%), then merges that node with B (combined 50%), then merges with A. The result: A gets a 1-bit code, B gets a 2-bit code, and C and D get 3-bit codes. Compare this to a fixed 2-bit code for all four symbols — Huffman's variable-length scheme uses fewer bits overall because it assigns shorter codes to more frequent symbols. This is the core insight: frequency determines depth, and the greedy merge ensures the mapping is optimal.

Practice Questions 5 questions

Prerequisite Chain

Counting to 10Counting to 20Understanding ZeroThe Number ZeroCounting to FiveOne-to-One CorrespondenceCombining Small Groups Within 5Addition Within 10Addition Within 20Two-Digit Addition Without RegroupingTwo-Digit Addition with RegroupingAddition Within 100Repeated Addition as MultiplicationMultiplication Facts Within 100Division as Equal SharingDivision as Grouping (Measurement Division)Division: Grouping (Repeated Subtraction) ModelDivision: Fair Sharing ModelDivision as Equal SharingDivision as GroupingBasic Division FactsDivision Facts Within 100Two-Digit by One-Digit DivisionDivision with RemaindersRemainders and Quotients in DivisionDivision Word ProblemsIntroduction to Long DivisionFactors and MultiplesPrime and Composite NumbersEquivalent FractionsRelating Fractions and DecimalsDecimal Place ValueReading and Writing DecimalsComparing and Ordering DecimalsAdding and Subtracting DecimalsMultiplying DecimalsDividing DecimalsDividing FractionsMixed Number ArithmeticOrder of OperationsInteger Order of OperationsVariable ExpressionsCombining Like TermsOne-Step EquationsTwo-Step EquationsSolving Multi-Step EquationsEquations with Variables on Both SidesLiteral EquationsSlope-Intercept FormPoint-Slope FormWriting Linear EquationsParallel and Perpendicular Line SlopesGraphing Linear EquationsSystems of Equations — Graphing MethodSystems of Equations — Elimination MethodSystems of Three VariablesMatrices IntroductionGraph Representation: Matrices and ListsGraph Paths, Cycles, and ConnectivityTrees and Spanning TreesBinary TreesTree TraversalsBreadth-First Search (BFS)Topological SortDynamic ProgrammingEdit Distance: Levenshtein Distance and DP0/1 Knapsack Problem: Bounded Capacity DPGreedy AlgorithmsHuffman Coding: Optimal Prefix Codes via Greedy

Longest path: 69 steps · 356 total prerequisite topics

Prerequisites (3)

Leads To (1)