A tokenizer encounters the input `<=`. It has defined patterns: `<` maps to LESS_THAN, and `<=` maps to LESS_EQUAL. Which token does it produce, and why?
ATwo tokens: LESS_THAN for `<` followed by EQUAL for `=`
BOne token: LESS_EQUAL for `<=`, because the longest match rule selects the pattern that matches the most characters
CAn error — `<=` is ambiguous because both patterns could apply
DOne token: LESS_THAN, because simpler patterns take priority
The longest match rule (maximal munch) resolves this: when multiple patterns match at the current position, the tokenizer picks the one that consumes the most characters. `<=` matches two characters, while `<` alone matches only one. So `<=` wins and produces LESS_EQUAL. Without this rule, operators like `<=`, `>=`, `==`, and `!=` would be incorrectly broken into single-character tokens.
Question 2 Multiple Choice
In the tokenized output for the source text `if (count >= 10)`, what is the *lexeme* for the `>=` operator?
AGE (the token type name)
B`>=` (the actual two-character substring from source code)
C2 (the character count of the match)
DBoth `>=` and GE together — a lexeme is always a type-value pair
A lexeme is the raw substring of source text that was matched — in this case, the two characters `>=`. A token is the lexeme *paired with* its category label (GE or GREATER_EQUAL). The distinction matters: many different lexemes can produce the same token type (every variable name like `count`, `total`, `i` produces an IDENTIFIER token), but the lexeme preserves what was actually written so error messages and debugging can refer back to the source.
Question 3 True / False
The keyword `while` in most programming languages would be tokenized as an IDENTIFIER, because it matches the identifier pattern `[a-zA-Z_][a-zA-Z0-9_]*`.
TTrue
FFalse
Answer: False
When `while` matches both the keyword pattern and the identifier pattern at the same length, the *priority rule* breaks the tie by giving keywords higher priority than identifiers. The resulting token is KEYWORD or WHILE, not IDENTIFIER. This priority rule is essential: without it, every language keyword would be treated as a user-defined variable name, making the language effectively unable to distinguish reserved words from identifiers.
Question 4 True / False
In most compilers, whitespace and comments between tokens are consumed by the tokenizer but not emitted as tokens in the output sequence.
TTrue
FFalse
Answer: True
This is a fundamental design choice in tokenization: whitespace (spaces, tabs, newlines) and comments carry no semantic meaning for the parser, so they are consumed and discarded rather than emitted. The result is a clean, linear token stream where every element is semantically meaningful. This simplifies every subsequent compiler phase — the parser never has to check 'is this a space or a real token?' Some tools (formatters, documentation generators) choose to preserve whitespace and comments, but this is non-standard and done for specific tooling purposes.
Question 5 Short Answer
Explain the difference between a lexeme and a token, and give a concrete example showing why the distinction matters.
Think about your answer, then reveal below.
Model answer: A lexeme is the actual substring of source code that was matched by the tokenizer — the raw text. A token is the lexeme paired with its category label (token type). For example, the identifiers `count`, `totalItems`, and `x` are three different lexemes, but all produce tokens of type IDENTIFIER. Conversely, `+` and `-` are two different lexemes that produce tokens of different types (PLUS and MINUS). The distinction matters because the parser works with token types (not raw text) to recognize grammatical structure, while error messages and source maps need the original lexeme text to point back to specific locations in source code. A lexeme with no type would be meaningless; a type with no lexeme would lose source location information.