← Graph View All Domains

A topic in the Open Knowledge Graph — a free, open map of 15,290 topics and the order to learn them in.

Representer Theorem

Research Depth 101 in the knowledge graph ☐ I know this ☆ Set as goal

14topics build on this

709prerequisites beneath it

See this on the map →

Kernel Theory and RKHS Matrix Multiplication +1 more→→Regularization Theory (Tikhonov, Spectral)

Core Idea

The representer theorem states that the solution to any regularized empirical risk minimization problem over an RKHS — minimizing a loss plus a monotonically increasing function of the RKHS norm — lies in the span of the kernel functions evaluated at the training points: f*(x) = sum_{i=1}^{n} alpha_i * k(x_i, x). Even though the RKHS may be infinite-dimensional, the optimal function is determined by just n coefficients. This reduces an infinite-dimensional optimization problem to a finite-dimensional one, making kernel methods computationally tractable and explaining why the kernel matrix (Gram matrix) is the central object in kernel algorithms.

Explainer

The RKHS framework provides a rich, infinite-dimensional space of functions — but how do you actually optimize over such a space? The representer theorem answers this by showing that regularized optimization in an RKHS automatically produces solutions that live in a finite-dimensional subspace, making the infinite-dimensional problem tractable.

The setup is a regularized empirical risk minimization problem: minimize (1/n) * sum_{i=1}^{n} L(y_i, f(x_i)) + lambda * g(||f||), where L is a loss function, g is a monotonically increasing function (typically g(t) = t²), and f ranges over the RKHS H_k. The theorem states that the minimizer has the form f*(x) = sum_{i=1}^{n} alpha_i * k(x_i, x). The proof is elegant: decompose any f in the RKHS as f = f_span + f_perp, where f_span lies in the span of {k(x_1, ·), ..., k(x_n, ·)} and f_perp is orthogonal to this subspace. By the reproducing property, f(x_i) = <f, k(x_i, ·)> = <f_span, k(x_i, ·)> — the orthogonal component does not affect any training-point evaluation. So f_perp contributes nothing to the loss but increases the RKHS norm (||f||^2 = ||f_span||^2 + ||f_perp||^2). The regularizer penalizes the larger norm, so the optimal f_perp is zero.

This result transforms kernel learning into linear algebra. Substituting the representer form into the optimization problem, the objective becomes a function of the n-dimensional vector alpha, with all RKHS geometry encoded in the n-by-n kernel matrix K. For kernel ridge regression, the closed-form solution is alpha = (K + lambda * I)^-1 * y. For SVMs, the representer theorem justifies the dual formulation that depends only on kernel evaluations between training points. The computational cost scales with the number of training points (typically O(n³) or O(n²) depending on the algorithm), not with the dimensionality of the RKHS.

The representer theorem also clarifies the role of regularization in kernel methods. Beyond preventing overfitting, regularization is structurally necessary: it is what makes the optimization finite-dimensional. Without the norm penalty, the problem is ill-posed in the infinite-dimensional RKHS — there may be infinitely many functions achieving the same empirical risk, with no reason to prefer one over another. The regularizer selects the minimum-norm solution, which the representer theorem guarantees lies in the finite-dimensional span of the training kernel functions. This deep connection between regularization and tractability is one of the most elegant results in machine learning theory.

Practice Questions 4 questions

Prerequisite Chain

Understanding Zero → The Number Zero → Counting to Five → Counting to 10 → Counting to 20 → Counting a Set of Objects Up to 20 → Cardinality: The Last Number Counted → Matching Numerals to Quantities → Subitizing Small Quantities → Addition Within 10 → Number Bonds to 10 → Addition Within 20 → Doubles and Near Doubles → Doubles Facts Within 10 → Near Doubles Facts Within 20 → Mental Math Strategies for Addition → Mental Math: Adding and Subtracting Tens → Addition Within 100 → Repeated Addition as Multiplication → Multiplication as Equal Groups → Multiplication: Arrays → Basic Multiplication Facts (0s, 1s, 2s, 5s, 10s) → Multiplication Facts Within 100 → Division as Equal Sharing → Division as Grouping (Measurement Division) → Division: Grouping (Repeated Subtraction) Model → Division: Fair Sharing Model → Division as Equal Sharing → Division as Grouping → Basic Division Facts → Division Facts Within 100 → Multiplication and Division Fact Families → Relationship Between Multiplication and Division → Division Facts as Inverse of Multiplication → Remainders and Quotients in Division → Division Word Problems → Multi-Step Word Problems → Solving Multi-Step Word Problems → Multiplication Word Problems → Division Word Problems → Introduction to Long Division → Factors and Multiples → Prime and Composite Numbers → Equivalent Fractions → Relating Fractions and Decimals → Decimal Place Value → Integers and the Number Line → Comparing and Ordering Integers → Absolute Value → Adding Integers → Subtracting Integers → Multiplying Integers → Dividing Integers → Unit Rates → Proportions → Percent Concept → Converting Between Fractions, Decimals, and Percents → Operations with Rational Numbers → Two-Step Equations → Solving Multi-Step Equations → Equations with Variables on Both Sides → Angle Pairs: Complementary, Supplementary, and Vertical → Parallel Lines and Transversals → Corresponding Angles → Alternate Interior Angles → Triangle Angle Sum Theorem → Exterior Angle Theorem → Triangle Inequality Theorem → Similar Triangles: AA Similarity → Similar Triangles: SSS and SAS Similarity → Proportions in Similar Triangles → Right Triangle Trigonometry Introduction → Sine, Cosine, and Tangent Ratios → Trigonometric Ratios Review → Radian Measure → Converting Between Degrees and Radians → The Unit Circle → Graphing Sine and Cosine → Graphing Tangent and Reciprocal Trigonometric Functions → Derivatives of Trigonometric Functions → Antiderivatives → Indefinite Integrals → Basic Integration Rules → Riemann Sums → Definite Integral Definition → Probability Density Functions and Continuous Distributions → Cumulative Distribution Functions → Continuous Random Variables → Probability Density Functions → Expected Value → Linear Regression in Machine Learning → Neural Network Fundamentals → Backpropagation Algorithm → Multilayer Perceptrons (MLPs) → Activation Functions in Neural Networks → Vanishing Gradient Problem → Gradient Descent and Optimization → Gradient Boosting Machines → Support Vector Machines → Kernel Methods and the Kernel Trick → Kernel Theory and RKHS → Representer Theorem

Longest path: 102 steps · 709 total prerequisite topics

Prerequisites (3)

Kernel Theory and RKHShard Regularization Techniquessoft Matrix Multiplicationsoft

Leads To (1)

Regularization Theory (Tikhonov, Spectral)soft