Chomsky Normal Form (CNF) is a standardized form for CFGs in which every production is either A → BC (two variables) or A → a (one terminal). Every context-free language has a CFG in CNF. Converting a grammar to CNF involves eliminating ε-productions, unit productions (A → B), and long productions, then ensuring only binary or terminal rules remain. CNF simplifies proofs about CFGs and enables the CYK algorithm for O(n³) parsing of any CFG. Parse trees for CNF grammars are full binary trees.
Practice the four-step conversion (eliminate ε-rules, unit rules, useless symbols, then binarize) on a concrete grammar. Verify that the converted grammar generates the same language (minus ε if it was originally in the language).
You already know that a context-free grammar (CFG) is a set of production rules that generate strings by substituting variables with combinations of variables and terminals. You've also seen that the same language can be generated by many different grammars — some elegant, some messy. Chomsky Normal Form is a way of rewriting any CFG into a standardized shape where every rule follows one of exactly two patterns: a variable produces two variables (`A → BC`), or a variable produces a single terminal (`A → a`). That's it. No rules with three symbols on the right, no rules that produce the empty string, no rules where one variable simply renames another.
Why bother with such a restrictive format? Because uniformity enables algorithms. When every rule either splits into exactly two branches or produces a single character, the parse tree becomes a full binary tree — every internal node has exactly two children. This rigid structure means that parsing a string of length *n* requires exactly *n* - 1 branching steps and *n* terminal steps, no matter the grammar. The CYK algorithm exploits this by filling in an *n* × *n* table, checking every possible way to split every substring into two halves, and determining in O(n³) time whether the string belongs to the language. Without CNF, you'd need to handle rules of arbitrary length, making a clean dynamic-programming approach much harder to formulate.
The conversion process itself is mechanical but must be done in the right order. First, you eliminate ε-productions (rules like `A → ε`) by finding every variable that can derive the empty string and creating new rules that account for its absence. Second, you eliminate unit productions (rules like `A → B` that just rename one variable to another) by tracing chains of renames and replacing them with direct rules. Third, you remove useless symbols — variables that can never be reached from the start symbol or that can never produce a terminal string. Finally, you binarize any remaining long rules: a rule like `A → BCDE` becomes `A → BX₁`, `X₁ → CX₂`, `X₂ → DE`, using fresh variables to break it into a chain of binary splits. Each step may create new issues that a previous step would have handled, which is why the ordering matters — eliminating ε-productions first prevents them from reappearing when you collapse unit rules.
The key conceptual point is that CNF conversion never changes the language — the new grammar generates exactly the same set of strings as the original (with the minor caveat that the empty string ε is excluded from CNF grammars, since there's no way to derive it without an ε-production). This is a powerful demonstration of a recurring theme in theory of computation: different representations of the same object can have very different algorithmic properties, and finding the right normal form transforms hard problems into tractable ones.