The universal approximation theorem (Cybenko, 1989; Hornik, 1991) proves that a feedforward neural network with a single hidden layer and a non-polynomial activation function can approximate any continuous function on a compact domain to arbitrary accuracy, given enough hidden units. This establishes that neural networks are universal function approximators — their approximation error can be driven to zero. However, the theorem says nothing about how many hidden units are needed (the width may need to be exponentially large) or whether gradient descent can find the approximating weights. The gap between approximation capacity and practical learnability is the central tension in neural network theory.
The question of what neural networks can represent — independent of how they are trained — is the domain of approximation theory. The universal approximation theorem is the foundational result: it proves that neural networks are, in principle, capable of representing any continuous function to any desired accuracy. This might sound like it settles the question of neural network power, but the theorem's limitations are as important as its guarantees.
The theorem states: for any continuous function f on a compact domain K in R^d, any epsilon > 0, and any non-polynomial continuous activation function sigma, there exists a single-hidden-layer network g(x) = sum_{i=1}^{N} alpha_i * sigma(w_i^T * x + b_i) such that |f(x) - g(x)| < epsilon for all x in K. The proof, in its original form by Cybenko (for sigmoidal activations) and generalized by Hornik, uses functional analysis — specifically, the fact that the span of translated and scaled activation functions is dense in the space of continuous functions. The key requirement is that sigma is non-polynomial: polynomial activations compute polynomials of bounded degree and cannot approximate arbitrary functions.
The theorem's critical limitation is that it says nothing about the width N required. For a simple low-frequency function, a few hidden neurons might suffice. For a highly oscillatory function or a function with sharp transitions, N might need to be astronomically large. Depth-separation results demonstrate this concretely: Telgarsky (2016) showed functions computable by deep networks of polynomial size that require exponential width to approximate with shallow networks. This means depth is not merely a training convenience — it provides genuine representational efficiency for certain function classes. The functions that benefit from depth tend to involve hierarchical or compositional structure, which matches the intuition that deep networks learn hierarchical features.
The gap between approximation and learning is the central open question in neural network theory. Approximation theory tells us that good weights exist; optimization theory asks whether gradient descent can find them (the loss landscape is non-convex and potentially riddled with local minima); generalization theory asks whether the network trained on finite data performs well on unseen data (the network may overfit, especially when over-parameterized). Modern deep learning theory works to bridge these gaps: over-parameterization results show that wide networks have benign loss landscapes where gradient descent finds global minima; implicit regularization results show that gradient descent preferentially finds solutions that generalize well; and neural tangent kernel theory connects the training dynamics of wide networks to kernel methods with well-understood generalization properties. But a complete theory that explains all three aspects simultaneously remains elusive.