A compiler is organized into distinct phases: lexical analysis, syntax analysis, semantic analysis, intermediate code generation, optimization, and code generation. Each phase transforms the program into a successively lower-level representation. Understanding overall organization is essential for implementing any specific phase.
Study classic multi-pass compiler models used in real compilers (gcc, clang, javac). Trace a simple program through each phase and identify which transformations occur.
All phases must be completely separate passes (many compilers interleave them). Lexical and syntax analysis are the hard parts (semantic analysis and optimization are often harder).
A compiler appears from the outside to be a single program that takes source code and produces an executable, but internally it is a pipeline of distinct transformations, each with a well-defined input and output representation. Understanding this organization tells you where different types of errors are caught, why certain language features are easy or hard to implement, and how to reason about performance.
The pipeline begins with *lexical analysis* (scanning), which reads the raw character stream and groups characters into tokens — the smallest meaningful units like keywords, identifiers, literals, and operators. The scanner does not understand structure; it only recognizes patterns. *Syntax analysis* (parsing) takes the token stream and checks whether it conforms to the language grammar, building a parse tree or AST in the process. Errors like mismatched parentheses or malformed expressions are caught here. Both scanning and parsing are largely mechanical — they are specified by formal grammars and regular expressions, and tools like Flex and Bison (or ANTLR) generate them automatically.
*Semantic analysis* is where deeper checking happens: type checking, name resolution, scope analysis, and enforcement of language-specific rules that cannot be expressed in a context-free grammar (like "a variable must be declared before use" or "break can only appear inside a loop"). The semantic analyzer builds and queries a *symbol table* — a data structure tracking every declared name, its type, and its scope. Many programmers find that this phase is more intellectually demanding than parsing because it requires reasoning about meaning, not just structure.
After semantic analysis, the compiler translates the AST into an *intermediate representation* (IR) — a simplified, architecture-neutral code form that is easier to optimize than either source code or machine code. The *optimization* phase then applies a series of passes to the IR: constant folding, dead code elimination, loop unrolling, function inlining, and more. These passes run in sequence and can be added or removed independently. Finally, *code generation* maps the optimized IR to machine instructions for the target architecture, handling instruction selection, register allocation, and instruction scheduling.
One common misconception is that these phases must be completely separate passes over the entire program. In practice, many compilers interleave them. A simple recursive-descent parser often interleaves parsing and semantic analysis; some production compilers generate IR instruction by instruction as they parse each construct. The phases are conceptually distinct but their implementation can overlap for efficiency. What matters is that the *concerns* remain separated — lexical rules are specified independently of grammar rules, type rules are independent of code generation strategies — because this separation is what makes compilers maintainable and extensible.