← Graph View All Domains

A topic in the Open Knowledge Graph — a free, open map of 15,290 topics and the order to learn them in.

Tokenization and Lexemes

Graduate Depth 80 in the knowledge graph ☐ I know this ☆ Set as goal

10topics build on this

332prerequisites beneath it

See this on the map →

Regular Expressions (Formal Language Theory)String Basics→→The Parsing Problem

Core Idea

Tokenization is the process of converting a source code string into a sequence of tokens (lexemes). Each token represents the smallest meaningful unit of a program: keywords, identifiers, operators, literals. Regular expressions define patterns for each token type, and the lexer matches input against these patterns to classify characters into tokens.

Explainer

Source code as written by a programmer is just a stream of characters — letters, digits, spaces, punctuation. Before a compiler can understand the structure of a program, it needs to group those characters into meaningful chunks. This is what tokenization (also called lexical analysis or scanning) does: it reads the raw character stream and produces a sequence of tokens, each labeled with a type and carrying the original text. Given the input `if (x >= 42)`, a tokenizer produces something like: `[KEYWORD:"if", LPAREN:"(", IDENT:"x", GE:">=", INT_LIT:"42", RPAREN:")"]`.

The terminology can be confusing, so here is the precise distinction. A lexeme is the actual substring of source code that was matched — for instance, `">="`or `"42"`. A token is the lexeme paired with its category — `GE` for the greater-than-or-equal operator, `INT_LIT` for an integer literal. Some token types have many possible lexemes (every variable name is an `IDENT`), while others have exactly one (`>=` is always `GE`). The tokenizer's job is to decide where each lexeme begins and ends, and which category it belongs to.

Your prerequisite knowledge of regular expressions is directly applied here. Each token type is defined by a regex pattern: identifiers might match `[a-zA-Z_][a-zA-Z0-9_]*`, integer literals match `[0-9]+`, and so on. The tokenizer tries all patterns at the current position in the input and picks the one that matches the longest prefix — this is the longest match rule. When two patterns match the same length (like `if` matching both the keyword pattern and the identifier pattern), a priority rule breaks the tie, typically favoring keywords over identifiers. These two rules — longest match and priority — are enough to make tokenization unambiguous for most programming languages.

Tokenization also handles the parts of source code that carry no semantic meaning: whitespace and comments are consumed but typically not emitted as tokens (though some compilers preserve them for formatting tools or documentation generators). This stripping is what makes subsequent phases simpler — the parser never has to worry about spaces between tokens or comments interrupting expressions. The output of tokenization is a clean, linear sequence of meaningful tokens that the parser can process using the grammar rules you will study next.

Practice Questions 5 questions

Prerequisite Chain

Understanding Zero → The Number Zero → Counting to Five → Counting to 10 → Counting to 20 → Counting a Set of Objects Up to 20 → Cardinality: The Last Number Counted → Matching Numerals to Quantities → Subitizing Small Quantities → Addition Within 10 → Number Bonds to 10 → Addition Within 20 → Doubles and Near Doubles → Doubles Facts Within 10 → Near Doubles Facts Within 20 → Mental Math Strategies for Addition → Mental Math: Adding and Subtracting Tens → Addition Within 100 → Repeated Addition as Multiplication → Multiplication as Equal Groups → Multiplication: Arrays → Basic Multiplication Facts (0s, 1s, 2s, 5s, 10s) → Multiplication Facts Within 100 → Division as Equal Sharing → Division as Grouping (Measurement Division) → Division: Grouping (Repeated Subtraction) Model → Division: Fair Sharing Model → Division as Equal Sharing → Division as Grouping → Basic Division Facts → Division Facts Within 100 → Multiplication and Division Fact Families → Relationship Between Multiplication and Division → Division Facts as Inverse of Multiplication → Remainders and Quotients in Division → Division Word Problems → Multi-Step Word Problems → Solving Multi-Step Word Problems → Multiplication Word Problems → Division Word Problems → Introduction to Long Division → Factors and Multiples → Prime and Composite Numbers → Equivalent Fractions → Relating Fractions and Decimals → Decimal Place Value → Integers and the Number Line → Comparing and Ordering Integers → Absolute Value → Adding Integers → Subtracting Integers → Multiplying Integers → Introduction to Exponents → Order of Operations → Integer Order of Operations → Variable Expressions → The Distributive Property → Variables and Expressions Review → Introduction to Polynomials → Adding and Subtracting Polynomials → Multiplying Polynomials → Factorial → Permutations → Combinations → Counting Principles: Addition and Multiplication Rules → Introduction to Graph Theory → Propositional Logic Foundations → Logical Equivalences → Boolean Algebra → Boolean Type and Truth Values → Comparison Operators and Boolean Tests → Logical Operators and Boolean Algebra → Boolean Algebra and Fundamental Laws → Logic Gates Fundamentals → Implementing Boolean Functions with Gates → Karnaugh Map Simplification → Combinational Circuit Design → Flip-Flops and Latches → Finite State Machines (FSMs) → Regular Expressions (Formal Language Theory) → Tokenization and Lexemes

Longest path: 81 steps · 332 total prerequisite topics

Prerequisites (2)

Regular Expressions (Formal Language Theory)hard String Basicshard

Leads To (1)

The Parsing Problemhard