Representer Theorem

Research Depth 75 in the knowledge graph I know this Set as goal
Unlocks 14 downstream topics
kernel-theory optimization regularization

Core Idea

The representer theorem states that the solution to any regularized empirical risk minimization problem over an RKHS — minimizing a loss plus a monotonically increasing function of the RKHS norm — lies in the span of the kernel functions evaluated at the training points: f*(x) = sum_{i=1}^{n} alpha_i * k(x_i, x). Even though the RKHS may be infinite-dimensional, the optimal function is determined by just n coefficients. This reduces an infinite-dimensional optimization problem to a finite-dimensional one, making kernel methods computationally tractable and explaining why the kernel matrix (Gram matrix) is the central object in kernel algorithms.

Explainer

The RKHS framework provides a rich, infinite-dimensional space of functions — but how do you actually optimize over such a space? The representer theorem answers this by showing that regularized optimization in an RKHS automatically produces solutions that live in a finite-dimensional subspace, making the infinite-dimensional problem tractable.

The setup is a regularized empirical risk minimization problem: minimize (1/n) * sum_{i=1}^{n} L(y_i, f(x_i)) + lambda * g(||f||), where L is a loss function, g is a monotonically increasing function (typically g(t) = t^2), and f ranges over the RKHS H_k. The theorem states that the minimizer has the form f*(x) = sum_{i=1}^{n} alpha_i * k(x_i, x). The proof is elegant: decompose any f in the RKHS as f = f_span + f_perp, where f_span lies in the span of {k(x_1, ·), ..., k(x_n, ·)} and f_perp is orthogonal to this subspace. By the reproducing property, f(x_i) = <f, k(x_i, ·)> = <f_span, k(x_i, ·)> — the orthogonal component does not affect any training-point evaluation. So f_perp contributes nothing to the loss but increases the RKHS norm (||f||^2 = ||f_span||^2 + ||f_perp||^2). The regularizer penalizes the larger norm, so the optimal f_perp is zero.

This result transforms kernel learning into linear algebra. Substituting the representer form into the optimization problem, the objective becomes a function of the n-dimensional vector alpha, with all RKHS geometry encoded in the n-by-n kernel matrix K. For kernel ridge regression, the closed-form solution is alpha = (K + lambda * I)^{-1} * y. For SVMs, the representer theorem justifies the dual formulation that depends only on kernel evaluations between training points. The computational cost scales with the number of training points (typically O(n^3) or O(n^2) depending on the algorithm), not with the dimensionality of the RKHS.

The representer theorem also clarifies the role of regularization in kernel methods. Beyond preventing overfitting, regularization is structurally necessary: it is what makes the optimization finite-dimensional. Without the norm penalty, the problem is ill-posed in the infinite-dimensional RKHS — there may be infinitely many functions achieving the same empirical risk, with no reason to prefer one over another. The regularizer selects the minimum-norm solution, which the representer theorem guarantees lies in the finite-dimensional span of the training kernel functions. This deep connection between regularization and tractability is one of the most elegant results in machine learning theory.

Practice Questions 4 questions

Prerequisite Chain

Counting to 10Counting to 20Understanding ZeroThe Number ZeroCounting to FiveOne-to-One CorrespondenceCombining Small Groups Within 5Addition Within 10Addition Within 20Two-Digit Addition Without RegroupingTwo-Digit Addition with RegroupingAddition Within 100Repeated Addition as MultiplicationMultiplication Facts Within 100Division as Equal SharingDivision as Grouping (Measurement Division)Division: Grouping (Repeated Subtraction) ModelDivision: Fair Sharing ModelDivision as Equal SharingDivision as GroupingBasic Division FactsDivision Facts Within 100Two-Digit by One-Digit DivisionDivision with RemaindersRemainders and Quotients in DivisionDivision Word ProblemsIntroduction to Long DivisionFactors and MultiplesPrime and Composite NumbersEquivalent FractionsRelating Fractions and DecimalsDecimal Place ValueReading and Writing DecimalsComparing and Ordering DecimalsAdding and Subtracting DecimalsMultiplying DecimalsDividing DecimalsDividing FractionsMixed Number ArithmeticOrder of OperationsInteger Order of OperationsVariable ExpressionsCombining Like TermsOne-Step EquationsTwo-Step EquationsSolving Multi-Step EquationsEquations with Variables on Both SidesLiteral EquationsSlope-Intercept FormPoint-Slope FormWriting Linear EquationsParallel and Perpendicular Line SlopesGraphing Linear EquationsPiecewise FunctionsOne-Sided LimitsContinuity DefinitionLimit Definition of the DerivativePower RuleConstant Multiple and Sum/Difference RulesProduct RuleChain RuleHigher-Order DerivativesConcavity and Inflection PointsSecond Derivative TestCurve SketchingOptimization ProblemsCritical Points of Multivariable FunctionsCritical Points and Classification of ExtremaSecond Partial Test for Local Extrema (Hessian)The Hessian Matrix and Second Derivative TestUnconstrained Optimization: Finding ExtremaOptimization in Multiple VariablesSupport Vector MachinesKernel Methods and the Kernel TrickKernel Theory and RKHSRepresenter Theorem

Longest path: 76 steps · 504 total prerequisite topics

Prerequisites (3)

Leads To (1)