The representer theorem states that the solution to any regularized empirical risk minimization problem over an RKHS — minimizing a loss plus a monotonically increasing function of the RKHS norm — lies in the span of the kernel functions evaluated at the training points: f*(x) = sum_{i=1}^{n} alpha_i * k(x_i, x). Even though the RKHS may be infinite-dimensional, the optimal function is determined by just n coefficients. This reduces an infinite-dimensional optimization problem to a finite-dimensional one, making kernel methods computationally tractable and explaining why the kernel matrix (Gram matrix) is the central object in kernel algorithms.
The RKHS framework provides a rich, infinite-dimensional space of functions — but how do you actually optimize over such a space? The representer theorem answers this by showing that regularized optimization in an RKHS automatically produces solutions that live in a finite-dimensional subspace, making the infinite-dimensional problem tractable.
The setup is a regularized empirical risk minimization problem: minimize (1/n) * sum_{i=1}^{n} L(y_i, f(x_i)) + lambda * g(||f||), where L is a loss function, g is a monotonically increasing function (typically g(t) = t^2), and f ranges over the RKHS H_k. The theorem states that the minimizer has the form f*(x) = sum_{i=1}^{n} alpha_i * k(x_i, x). The proof is elegant: decompose any f in the RKHS as f = f_span + f_perp, where f_span lies in the span of {k(x_1, ·), ..., k(x_n, ·)} and f_perp is orthogonal to this subspace. By the reproducing property, f(x_i) = <f, k(x_i, ·)> = <f_span, k(x_i, ·)> — the orthogonal component does not affect any training-point evaluation. So f_perp contributes nothing to the loss but increases the RKHS norm (||f||^2 = ||f_span||^2 + ||f_perp||^2). The regularizer penalizes the larger norm, so the optimal f_perp is zero.
This result transforms kernel learning into linear algebra. Substituting the representer form into the optimization problem, the objective becomes a function of the n-dimensional vector alpha, with all RKHS geometry encoded in the n-by-n kernel matrix K. For kernel ridge regression, the closed-form solution is alpha = (K + lambda * I)^{-1} * y. For SVMs, the representer theorem justifies the dual formulation that depends only on kernel evaluations between training points. The computational cost scales with the number of training points (typically O(n^3) or O(n^2) depending on the algorithm), not with the dimensionality of the RKHS.
The representer theorem also clarifies the role of regularization in kernel methods. Beyond preventing overfitting, regularization is structurally necessary: it is what makes the optimization finite-dimensional. Without the norm penalty, the problem is ill-posed in the infinite-dimensional RKHS — there may be infinitely many functions achieving the same empirical risk, with no reason to prefer one over another. The regularizer selects the minimum-norm solution, which the representer theorem guarantees lies in the finite-dimensional span of the training kernel functions. This deep connection between regularization and tractability is one of the most elegant results in machine learning theory.