For nonparametric regression over functions with smoothness s in d dimensions, the minimax rate is n^{-2s/(2s+d)}. What happens to this rate as the dimension d increases while smoothness s stays fixed?
AThe rate improves because higher-dimensional data contains more information
BThe rate worsens dramatically — the exponent 2s/(2s+d) approaches 0 as d grows, meaning the rate approaches n^0 = 1, and learning becomes essentially impossible without exponentially many samples
CThe rate is unaffected by dimension because smoothness is the only relevant property
DThe rate improves for d < 2s and worsens for d > 2s
As d increases, the exponent 2s/(2s+d) shrinks toward zero. For s=1, d=1: rate is n^{-2/3}. For s=1, d=10: rate is n^{-2/12} = n^{-1/6}. For s=1, d=100: rate is n^{-2/102} ≈ n^{-0.02}. To achieve 10% error at d=100 with s=1, you need n ≈ 0.1^{-50} = 10^{50} samples. This is the curse of dimensionality made precise: high-dimensional nonparametric estimation requires exponentially more data. The only escape is additional structure assumptions (sparsity, manifold, compositional) that reduce the effective dimension.
Question 2 True / False
A minimax optimal estimator achieves the best possible performance on every individual problem instance within the class.
TTrue
FFalse
Answer: False
Minimax optimality is about worst-case performance, not instance-by-instance performance. A minimax optimal estimator achieves the best possible worst-case error — it minimizes the maximum risk over all instances. But for any specific instance, there may exist a different estimator that performs better. For example, if the true function is very smooth, a minimax optimal estimator designed for a class including rough functions will be suboptimal on that particular instance. Adaptive estimators attempt to achieve near-minimax rates for each sub-class simultaneously, but they typically pay a logarithmic penalty compared to the oracle that knows the true complexity.
Question 3 True / False
The minimax rate for estimating a d-dimensional mean from n Gaussian observations is Theta(d/n). This means doubling the dimension requires doubling the sample size to maintain the same accuracy.
TTrue
FFalse
Answer: True
For d-dimensional Gaussian mean estimation, the minimax risk under squared error is exactly d*sigma^2/n (achieved by the sample mean). This linear relationship means that if you want error at most epsilon, you need n >= d*sigma^2/epsilon, which scales linearly with d. Doubling d from 100 to 200 requires doubling n. This is the parametric curse of dimensionality, which is relatively mild compared to the nonparametric curse (where the rate involves d in the exponent). The linear scaling reflects the fact that each dimension adds one independent quantity to estimate.
Question 4 Short Answer
Explain the distinction between parametric and nonparametric minimax rates, and why the nonparametric rate reveals the curse of dimensionality more severely.
Think about your answer, then reveal below.
Model answer: In parametric estimation (fixed number of parameters d), the minimax rate is typically Theta(d/n) — linear in dimension, inverse in sample size. This is 'mild' because the dependence on d is polynomial. In nonparametric estimation (the function class is infinite-dimensional, characterized by smoothness s), the minimax rate is Theta(n^{-2s/(2s+d)}). The dimension d appears in the exponent, making the rate exponentially worse in high dimensions. For d=100 and s=1, achieving epsilon accuracy requires n proportional to epsilon^{-51}, compared to epsilon^{-2} for parametric problems. The nonparametric rate reveals the true curse: without parametric structure (a finite-dimensional model), estimating functions in high dimensions from data is fundamentally intractable unless the functions have very high smoothness or the effective dimension is low (e.g., the function depends on a low-dimensional projection of the input).
This distinction motivates much of modern ML: neural networks are so effective partly because they implicitly exploit low-dimensional structure in high-dimensional data, achieving rates much better than the nonparametric worst case — but the theoretical understanding of which structural assumptions they exploit is incomplete.