A topic in the Open Knowledge Graph — a free, open map of 15,290 topics and the order to learn them in.

Bayesian Optimization

Graduate Depth 102 in the knowledge graph ☐ I know this ☆ Set as goal

779prerequisites beneath it

Hyperparameter Optimization Bayes' Theorem and Statistical Inference +1 more→

Core Idea

Bayesian optimization efficiently searches hyperparameter spaces by modeling the objective as a Gaussian process and using acquisition functions to guide exploration. It balances exploration (trying unknown regions) and exploitation (refining good regions). This dramatically reduces function evaluations compared to grid or random search.

Explainer

From your work with hyperparameter optimization, you know the basic problem: training a model with a given set of hyperparameters is expensive (minutes to hours per evaluation), and the search space can be large (learning rate, regularization strength, architecture choices, etc.). Grid search is exhaustive but wasteful; random search is better but still blind to the results of previous trials. Bayesian optimization is the principled alternative — it uses every past evaluation to decide where to look next.

The method has two components. First, a surrogate model — typically a Gaussian process (GP) — that approximates the unknown objective function (e.g., validation accuracy as a function of hyperparameters). After evaluating the objective at a few initial points, the GP fits a probabilistic model that provides not just a predicted value at any untried point, but also an uncertainty estimate. Where you have evaluated, the GP is confident and its predictions hug the observed values. Where you haven't evaluated, the GP is uncertain and its confidence bands widen. This uncertainty map is the key ingredient that grid and random search lack entirely.

Second, an acquisition function translates the GP's predictions and uncertainties into a score for each candidate point, answering "where should I evaluate next?" The most common acquisition function is Expected Improvement (EI): given the best result observed so far, EI computes the expected amount by which a new point would improve upon it, integrating over the GP's uncertainty. Points where the GP predicts high performance score well (exploitation), but so do points where the GP is very uncertain, because they might harbor unexpectedly good results (exploration). This exploration-exploitation tradeoff is handled automatically — EI naturally favors uncertain regions when exploitation opportunities are exhausted and focuses on promising regions when they emerge.

The optimization loop is straightforward: (1) fit the GP to all observations so far, (2) maximize the acquisition function to select the next point to evaluate, (3) evaluate the true objective at that point, (4) add the result to the observation set, and repeat. Because maximizing the acquisition function is cheap (it's an analytical function of the GP, not a full model training run), the computational cost is dominated by the actual objective evaluations. In practice, Bayesian optimization typically finds near-optimal hyperparameters in 20–50 evaluations where random search might need hundreds, making it particularly valuable when each evaluation involves training a large model.

The approach does have limitations. Gaussian processes scale cubically with the number of observations, so they become unwieldy beyond a few thousand evaluations — though this rarely matters since the whole point is to minimize evaluations. High-dimensional search spaces (more than about 20 hyperparameters) challenge GPs because the surrogate model becomes too uncertain to guide search effectively. For these settings, variants like Tree-structured Parzen Estimators (TPE) used in Optuna, or random forest-based surrogates used in SMAC, provide scalable alternatives that maintain the Bayesian principle of learning from past evaluations without requiring a full GP.

Practice Questions 5 questions

Prerequisite Chain

Understanding Zero → The Number Zero → Counting to Five → Counting to 10 → Counting to 20 → Counting a Set of Objects Up to 20 → Cardinality: The Last Number Counted → Matching Numerals to Quantities → Subitizing Small Quantities → Addition Within 10 → Number Bonds to 10 → Addition Within 20 → Doubles and Near Doubles → Doubles Facts Within 10 → Near Doubles Facts Within 20 → Mental Math Strategies for Addition → Mental Math: Adding and Subtracting Tens → Addition Within 100 → Repeated Addition as Multiplication → Multiplication as Equal Groups → Multiplication: Arrays → Basic Multiplication Facts (0s, 1s, 2s, 5s, 10s) → Multiplication Facts Within 100 → Division as Equal Sharing → Division as Grouping (Measurement Division) → Division: Grouping (Repeated Subtraction) Model → Division: Fair Sharing Model → Division as Equal Sharing → Division as Grouping → Basic Division Facts → Division Facts Within 100 → Multiplication and Division Fact Families → Relationship Between Multiplication and Division → Division Facts as Inverse of Multiplication → Remainders and Quotients in Division → Division Word Problems → Multi-Step Word Problems → Solving Multi-Step Word Problems → Multiplication Word Problems → Division Word Problems → Introduction to Long Division → Factors and Multiples → Prime and Composite Numbers → Equivalent Fractions → Relating Fractions and Decimals → Decimal Place Value → Integers and the Number Line → Comparing and Ordering Integers → Absolute Value → Adding Integers → Subtracting Integers → Multiplying Integers → Introduction to Exponents → Order of Operations → Integer Order of Operations → Variable Expressions → The Distributive Property → Variables and Expressions Review → Introduction to Polynomials → Adding and Subtracting Polynomials → Multiplying Polynomials → Factorial → Permutations → Combinations → Counting Principles: Addition and Multiplication Rules → Introduction to Graph Theory → Propositional Logic Foundations → Logical Equivalences → Boolean Algebra → Boolean Type and Truth Values → Comparison Operators and Boolean Tests → Logical Operators and Boolean Algebra → Conditional Statements → Defining and Calling Functions → Functions: Decomposing Problems → Function Parameters and Argument Passing → Return Values → Variable Scope → Introduction to Classes → Objects and Instances → Methods and Attributes → Algorithm Design Basics → Tree Structure and Node Properties → Binary Trees → Tree Traversals → Depth-First Search (DFS) → Depth-First Search: Implementation and Applications → Topological Sort → Dynamic Programming → Longest Common Subsequence (LCS) Problem → Edit Distance: Levenshtein Distance and DP → 0/1 Knapsack Problem: Bounded Capacity DP → Greedy Algorithms → Activity Selection Problem Using Greedy Algorithms → Dijkstra's Algorithm → A* Search Algorithm → Heuristic Search Functions → Local Search Optimization → Genetic Algorithms → Stochastic Gradient Descent and Variants → Optimization Algorithms: SGD, Adam, RMSprop → Hyperparameter Optimization → Bayesian Optimization

Longest path: 103 steps · 779 total prerequisite topics

Prerequisites (3)

Hyperparameter Optimizationhard Bayes' Theorem and Statistical Inferencesoft Expected Value and Variancesoft

Leads To (0)

No topics depend on this one yet.