A scanner specification lists the keyword 'if' before the general identifier pattern [a-zA-Z_][a-zA-Z0-9_]*. When the scanner processes the input 'iffy', which token does it produce?
ATwo tokens: keyword 'if' followed by identifier 'fy'
BOne token: identifier 'iffy', because the longest match rule takes precedence
CA lexical error, because 'iffy' partially matches both a keyword and an identifier
DOne token: keyword 'if', because keywords always have highest priority
The longest match rule governs first: the scanner keeps consuming characters as long as a valid transition exists, then returns to the last accepting state. Reading 'iffy', the scanner can continue transitioning after 'if' (the 'f' and 'y' characters are valid identifier continuations), so it does not stop at 'if'. Priority ordering only breaks ties when two patterns match strings of *equal* length — for example, the input 'if ' in source code, where the space terminates the identifier pattern exactly at 'if'. Option 0 is the most common misconception: keyword priority does not override length.
Question 2 Multiple Choice
Why does a scanner generator convert the combined NFA to a DFA before emitting scanner code, rather than simulating the NFA directly at runtime?
ANFAs cannot recognize the same languages as DFAs and would miss some tokens
BDFAs enable deterministic, O(1)-per-character processing: each state and input character maps to exactly one next state, enabling a simple table-driven scanner loop
CNFAs require exponentially more memory than DFAs and cannot be stored in a transition table
DDFAs are simpler to construct from regular expressions than NFAs using Thompson's construction
NFAs and DFAs recognize exactly the same class of languages (the regular languages), so option 0 is wrong. The advantage of DFAs is determinism: at every step, current_state and current_character uniquely determine next_state. This makes the scanner loop trivially simple and fast — one table lookup per character, with no branching over multiple possible next states. Simulating an NFA requires maintaining a *set* of active states and computing ε-closures, which is slower and harder to compile efficiently. DFA table-driven scanning runs in O(n) time with tiny constants.
Question 3 True / False
A scanner generator combines all token patterns into a single NFA (using alternation) before converting to a DFA, so that the resulting DFA can classify tokens from any of the specified patterns in a single left-to-right pass.
TTrue
FFalse
Answer: True
Each token's regex produces a separate NFA fragment via Thompson's construction. The generator creates a new start state with ε-transitions to all fragment start states, merging them into one combined NFA via alternation. Subset construction on this combined NFA produces a single DFA where each DFA state implicitly encodes which NFA states (and thus which token patterns) are still live. When the DFA reaches an accepting state, it knows which token was matched. This is why a scanner need not run separate automata for each token type — the single DFA handles all patterns simultaneously.
Question 4 True / False
Because scanner generators use regular expressions, a sufficiently complex regex can recognize inputs with balanced nested parentheses, eliminating the need for a separate parser phase.
TTrue
FFalse
Answer: False
Regular expressions define exactly the class of *regular* languages, which cannot count. The pumping lemma for regular languages proves that no finite automaton can recognize strings like { (ⁿ)ⁿ : n ≥ 1 } — balanced nested structures require a pushdown automaton (equivalent to a context-free grammar). This is the fundamental reason compilers have two distinct phases: the scanner handles the flat, non-counting structure of tokens (keywords, literals, identifiers), while the parser uses a context-free grammar to handle recursive structure like nested expressions and blocks.
Question 5 Short Answer
Describe the pipeline from a regular expression specification to executable scanner code. What happens at each stage and why?
Think about your answer, then reveal below.
Model answer: Stage 1: Parse each token regex into a Thompson-NFA fragment, where each regex operator (concatenation, alternation, Kleene star) maps to a small NFA fragment with ε-transitions. Stage 2: Combine all fragments into one NFA via alternation (ε-transitions from a shared start state). Stage 3: Convert to a DFA using subset construction — each DFA state corresponds to a set of simultaneously-active NFA states. Stage 4: Optionally minimize the DFA. Stage 5: Emit the DFA as a transition table plus a driver loop that reads one character at a time, follows table entries, and reports the longest match.
The NFA→DFA step is the heart of the pipeline: it trades an exponentially larger DFA state space (worst case) for deterministic character processing. The driver loop then becomes trivially simple — one table lookup per character with no backtracking logic. This separation of concerns (declarative regex specification → automatic automaton construction → efficient table-driven execution) is what makes scanner generators practical: the compiler writer specifies *what* tokens look like, and the generator handles *how* to recognize them.