A model has 6 hyperparameters, but only learning rate and batch size meaningfully affect performance. A researcher runs 50 evaluations. Compared to grid search over all 6 parameters, random search would most likely:
APerform worse because it does not evaluate every combination systematically
BPerform comparably because both methods sample the same number of configurations
CFind better learning rate and batch size values because it explores more distinct values of those important dimensions per evaluation
DOnly outperform grid search for deep learning models, not other model types
When only a subset of hyperparameters truly matters, grid search wastes most of its budget varying the unimportant ones. With 6 hyperparameters at 5 values each, a grid has 15,625 points — most varying hyperparameters that don't matter. Random search with 50 evaluations will hit many more distinct values of the 2 critical hyperparameters across their full range. Bergstra and Bengio (2012) demonstrated this empirically: random search often finds good configurations faster than grid search of the same budget.
Question 2 Multiple Choice
What distinguishes Bayesian optimization from both grid and random search in how it selects configurations to evaluate?
AIt evaluates every combination in the hyperparameter space exhaustively before reporting results
BIt samples configurations randomly but then applies a filter to remove obviously bad ones
CIt builds a probabilistic surrogate model of the performance landscape and uses an acquisition function to direct evaluations toward promising regions
DIt fixes the least important hyperparameters first and then exhaustively searches the remaining ones
Bayesian optimization uses a surrogate model (typically a Gaussian process) that represents current beliefs about how hyperparameters map to validation performance. After each evaluation, the model updates its beliefs, and an acquisition function (such as expected improvement) selects the next most informative configuration — balancing exploitation (near known good regions) and exploration (uncertain regions). This directed search is fundamentally different from the blind sampling of grid and random search.
Question 3 True / False
Random search is almost seldom better than grid search for hyperparameter optimization because grid search is exhaustive and therefore expected to find the optimal combination.
TTrue
FFalse
Answer: False
Grid search is exhaustive only within the discrete grid you define — it cannot be practically exhaustive for continuous hyperparameter spaces. More importantly, random search consistently outperforms grid search when hyperparameters have unequal importance (which is typical). For the same evaluation budget, random search explores more distinct values of the important hyperparameters. Grid search wastes evaluations on combinations that vary only unimportant hyperparameters while holding important ones fixed at the same few grid values.
Question 4 True / False
Bayesian optimization uses an acquisition function to balance exploring uncertain regions of the hyperparameter space against exploiting regions already known to perform well.
TTrue
FFalse
Answer: True
The acquisition function (e.g., expected improvement, upper confidence bound) operationalizes the exploration-exploitation tradeoff. Regions of the hyperparameter space that the surrogate model is uncertain about (high variance) have high exploration value; regions near previously good configurations have high exploitation value. By weighing both, Bayesian optimization avoids both excessive exploitation (getting stuck at a local optimum) and excessive exploration (evaluating configurations that are unlikely to be good).
Question 5 Short Answer
Why does Bayesian optimization typically require fewer training runs than random search to find a high-performing hyperparameter configuration, and when is this advantage most valuable?
Think about your answer, then reveal below.
Model answer: Bayesian optimization builds a surrogate model that learns the shape of the performance landscape — which regions of hyperparameter space tend to produce high validation scores — and uses this learned model to direct future evaluations. Rather than sampling blindly, it concentrates evaluations where the acquisition function predicts the most gain. This is most valuable when each training run is expensive (hours or days), such as large deep learning models. For cheap models where 1,000 random evaluations are feasible in minutes, the overhead of maintaining the surrogate may not justify the complexity.
The computational overhead of Bayesian optimization (fitting and querying the surrogate model) is negligible compared to training runs that take hours. For a model where each run takes 4 hours, 50 Bayesian evaluations (~8 days) may find better hyperparameters than 200 random evaluations (~33 days). For a model that trains in seconds, random search over thousands of configurations is simpler and nearly as effective.