← Graph View All Domains

A topic in the Open Knowledge Graph — a free, open map of 15,290 topics and the order to learn them in.

Decision Trees and Random Forests

Graduate Depth 82 in the knowledge graph ☐ I know this ☆ Set as goal

23topics build on this

398prerequisites beneath it

See this on the map →

Algorithm Design Basics Probability Axioms→→Advanced Ensemble Methods

Core Idea

Decision trees partition feature space recursively using splitting criteria like information gain. Random forests reduce overfitting by averaging predictions from trees trained on random data/feature subsets, creating decorrelated learners robust to variance.

Explainer

A decision tree works exactly like a flowchart of yes/no questions. At each internal node, the algorithm asks a question about one feature — "Is income > $50,000?" or "Is age ≤ 30?" — and splits the data into two branches based on the answer. This splitting continues recursively until the leaves contain data points that are sufficiently homogeneous (mostly one class for classification, or similar values for regression). The result is a partition of the entire feature space into rectangular regions, each assigned a prediction.

The critical question is: which feature and which threshold should each split use? The algorithm tries every possible split and selects the one that produces the most information gain — the greatest reduction in impurity or uncertainty. For classification, impurity is typically measured by entropy (from information theory) or the Gini index (the probability that two randomly chosen examples from the node would have different labels). A split that separates cats from dogs perfectly has zero impurity in both children; a split that leaves both children as mixed as the parent gains nothing. The algorithm greedily picks the best split at each node, building the tree top-down. Because you know probability axioms, you can see that these splitting criteria are just measuring how far the class distribution at a node is from uniform (maximum uncertainty) or from pure (zero uncertainty).

A single decision tree is interpretable and fast but has a serious problem: overfitting. A deep tree can memorize the training data perfectly, creating tiny leaf nodes that capture noise rather than real patterns. Pruning (removing branches that don't improve validation performance) helps, but a more powerful solution is to build many trees and combine them. A random forest creates hundreds or thousands of trees, each trained on a different bootstrap sample (random sample with replacement) of the training data. At each split, only a random subset of features is considered, which decorrelates the trees — they make different errors on different examples. The final prediction is the majority vote (classification) or average (regression) across all trees.

Why does averaging decorrelated trees work so well? Each individual tree has high variance — small changes in the training data produce very different trees. But when you average many high-variance, low-bias estimators whose errors are not correlated, the variance decreases while the low bias is preserved. This is the statistical insight behind all ensemble methods. Random forests are remarkably robust in practice: they handle mixed feature types, missing data, and high-dimensional inputs with minimal tuning, and they rarely overfit even with very large numbers of trees. The main tradeoff is interpretability — a single tree is transparent, but a forest of 500 trees is a black box, though feature importance scores (measuring how much each feature reduces impurity across all trees) partially recover interpretability.

Practice Questions 5 questions

Prerequisite Chain

Understanding Zero → The Number Zero → Counting to Five → Counting to 10 → Counting to 20 → Counting a Set of Objects Up to 20 → Cardinality: The Last Number Counted → Matching Numerals to Quantities → Subitizing Small Quantities → Addition Within 10 → Number Bonds to 10 → Addition Within 20 → Doubles and Near Doubles → Doubles Facts Within 10 → Near Doubles Facts Within 20 → Mental Math Strategies for Addition → Mental Math: Adding and Subtracting Tens → Addition Within 100 → Repeated Addition as Multiplication → Multiplication as Equal Groups → Multiplication: Arrays → Basic Multiplication Facts (0s, 1s, 2s, 5s, 10s) → Multiplication Facts Within 100 → Division as Equal Sharing → Division as Grouping (Measurement Division) → Division: Grouping (Repeated Subtraction) Model → Division: Fair Sharing Model → Division as Equal Sharing → Division as Grouping → Basic Division Facts → Division Facts Within 100 → Multiplication and Division Fact Families → Relationship Between Multiplication and Division → Division Facts as Inverse of Multiplication → Remainders and Quotients in Division → Division Word Problems → Multi-Step Word Problems → Solving Multi-Step Word Problems → Multiplication Word Problems → Division Word Problems → Introduction to Long Division → Factors and Multiples → Prime and Composite Numbers → Equivalent Fractions → Relating Fractions and Decimals → Decimal Place Value → Integers and the Number Line → Comparing and Ordering Integers → Absolute Value → Adding Integers → Subtracting Integers → Multiplying Integers → Introduction to Exponents → Order of Operations → Integer Order of Operations → Variable Expressions → The Distributive Property → Variables and Expressions Review → Introduction to Polynomials → Adding and Subtracting Polynomials → Multiplying Polynomials → Factorial → Permutations → Combinations → Counting Principles: Addition and Multiplication Rules → Introduction to Graph Theory → Propositional Logic Foundations → Logical Equivalences → Boolean Algebra → Boolean Type and Truth Values → Comparison Operators and Boolean Tests → Logical Operators and Boolean Algebra → Conditional Statements → Defining and Calling Functions → Functions: Decomposing Problems → Function Parameters and Argument Passing → Return Values → Variable Scope → Introduction to Classes → Objects and Instances → Methods and Attributes → Algorithm Design Basics → Decision Trees and Random Forests

Longest path: 83 steps · 398 total prerequisite topics

Prerequisites (2)

Algorithm Design Basicssoft Probability Axiomssoft

Leads To (1)

Advanced Ensemble Methodshard