Sample complexity bounds answer the most practical question in learning theory: how many training examples are necessary and sufficient to learn a concept class to a desired accuracy? Upper bounds show that m(epsilon, delta) samples suffice for any algorithm (typically ERM); lower bounds show that no algorithm can learn with fewer. For realizable PAC learning with VC dimension d, the sample complexity is Theta((d/epsilon) * log(1/epsilon) + (1/epsilon) * log(1/delta)). For agnostic learning, it is Theta(d/epsilon^2 + log(1/delta)/epsilon^2). These bounds bridge theory and practice by converting abstract complexity measures (VC dimension) into concrete data requirements.
Sample complexity bounds are where learning theory meets practice most directly. They answer the question every practitioner implicitly asks: "how much data do I need?" While the bounds are worst-case and often conservative, they provide the correct scaling relationships — how data requirements grow with model complexity, desired accuracy, and confidence level.
The realizable case (where the target function is in the hypothesis class) gives the cleanest bounds. The upper bound, proved via uniform convergence, states that m = O((d/epsilon) * log(1/epsilon) + (1/epsilon) * log(1/delta)) samples suffice, where d is the VC dimension. The matching lower bound, proved via information-theoretic arguments, shows that Omega(d/epsilon + log(1/delta)/epsilon) samples are necessary. These match up to the log(1/epsilon) factor, which is known to be tight. The key insight is the linear dependence on d — each additional unit of VC dimension adds a proportional amount to the data requirement.
The agnostic case (where the best hypothesis in the class may have nonzero error) has tighter matching bounds: Theta(d/epsilon^2 + log(1/delta)/epsilon^2). The extra factor of 1/epsilon compared to the realizable case reflects the statistical cost of estimating error rates rather than detecting the absence of errors. This quadratic dependence on 1/epsilon has significant practical implications: achieving 1% error requires 100 times more data than achieving 10% error, and achieving 0.1% error requires 10,000 times more than 10%. The implication is clear — pushing accuracy to very low levels requires enormous datasets or strong inductive biases.
Beyond the basic PAC bounds, sample complexity theory has been refined for specific hypothesis classes and learning settings. For linear classifiers in d dimensions (VC dimension d+1), the bound is Theta(d/epsilon^2). For kernel methods with margin gamma, the effective dimension is R^2/gamma^2 (where R is the data radius), which can be much smaller than the ambient dimension. For deep networks, the sample complexity is less well-characterized — it depends on the complexity measure used (spectral norms, PAC-Bayes, compression) and the resulting bounds are often loose. The general principle remains: sample complexity is proportional to the effective complexity of the hypothesis class (however measured) and inversely proportional to the square of the desired accuracy. This principle guides decisions about model selection, data collection, and the feasibility of learning tasks.