Regularization theory provides the mathematical framework for solving ill-posed inverse problems — problems where the solution does not depend continuously on the data. In machine learning, learning from finite samples is ill-posed: small changes in the training data can cause large changes in the learned function. Tikhonov regularization stabilizes the problem by adding a squared-norm penalty, shrinking the solution toward zero. Spectral regularization generalizes this by applying a filter function to the eigenvalues of the kernel matrix, controlling which frequency components of the solution are retained. Both approaches can be understood through the bias-variance lens: the regularization parameter trades off approximation error against estimation stability.
Regularization in machine learning is often presented as a practical trick to prevent overfitting — add a penalty to the loss and tune its strength. Regularization theory reveals the deeper mathematical reason this works: learning from finite data is an ill-posed inverse problem, and regularization is the principled way to restore well-posedness.
An inverse problem is well-posed (in Hadamard's sense) if a solution exists, is unique, and depends continuously on the data. Learning from finite samples violates the third condition: the mapping from training data to learned function is discontinuous. Small perturbations to the labels can cause the learned function to change dramatically, especially when the model is flexible. In the spectral view, the kernel matrix K has eigenvalues that decay toward zero. The unregularized solution involves dividing by these eigenvalues (inverting K), which amplifies noise in the directions corresponding to small eigenvalues — exactly the high-frequency, fine-grained components where the signal-to-noise ratio is worst.
Tikhonov regularization adds lambda * ||f||^2 to the loss, changing the effective inversion from K^{-1} to (K + lambda * I)^{-1} * K. In the eigendecomposition, each eigencomponent is multiplied by the filter factor sigma_i / (sigma_i + lambda) instead of being divided by sigma_i. When sigma_i is large (strong signal directions), the filter is close to 1 — the information is preserved. When sigma_i is small (noisy directions), the filter suppresses the component toward zero. The regularization parameter lambda sets the threshold: eigencomponents above lambda pass through; those below lambda are attenuated. This is a smooth, principled tradeoff between retaining signal and suppressing noise.
Spectral regularization generalizes this idea. Any method that applies a filter function g_lambda(sigma) to the eigenvalues of the kernel matrix is a spectral regularizer. Tikhonov uses g(sigma) = sigma/(sigma + lambda). Truncated SVD uses a hard cutoff: g(sigma) = 1 for sigma above a threshold, 0 below. Early stopping in iterative methods like gradient descent is also a spectral regularizer: after t iterations, the implicit filter is g(sigma) = 1 - (1 - eta*sigma)^t, which gradually incorporates more eigencomponents as training proceeds. This unifying eigenvalue perspective reveals that many seemingly different regularization strategies — norm penalties, truncation, early stopping — are all performing the same fundamental operation: controlling which spectral components of the solution are retained, trading bias for stability in a way that depends on the eigenstructure of the problem.