Concentration inequalities quantify how tightly a random variable clusters around its expected value. Unlike the central limit theorem (which gives asymptotic normality), concentration inequalities provide finite-sample, non-asymptotic tail bounds. Hoeffding's inequality bounds the deviation of a sum of bounded independent random variables: P(|mean - E[mean]| >= t) <= 2 * exp(-2nt^2) for variables in [0,1]. McDiarmid's inequality extends this to any function of independent variables with bounded differences. Bernstein's inequality tightens the bound when the variance is small. These are the workhorses of learning theory — every generalization bound, uniform convergence result, and sample complexity argument relies on concentration inequalities.
Concentration inequalities are the probabilistic engine that powers virtually every result in machine learning theory. They answer a specific question: given n independent samples, how likely is it that the sample statistic deviates from its expected value by more than a given amount? The answers they provide are non-asymptotic (valid for any n, not just large n) and distribution-free (valid for any distribution satisfying the stated conditions), making them the ideal tools for deriving PAC-style guarantees.
Hoeffding's inequality is the workhorse. If X_1, ..., X_n are independent random variables with X_i in [a_i, b_i], then P(|mean(X) - E[mean(X)]| >= t) <= 2 * exp(-2n^2t^2 / sum(b_i - a_i)^2). For variables in [0,1], this simplifies to P(|mean - mu| >= t) <= 2 * exp(-2nt^2). The bound is exponential in n — the probability of a large deviation decreases exponentially with the sample size. This exponential decay is why PAC bounds have logarithmic dependence on 1/delta: to achieve failure probability delta, you need n proportional to log(1/delta), because exp(-cn) = delta gives n = log(1/delta)/c.
McDiarmid's inequality extends Hoeffding to functions beyond simple means. If f(X_1, ..., X_n) satisfies the bounded differences property — changing any single X_i changes f by at most c_i — then P(|f - E[f]| >= t) <= 2 * exp(-2t^2 / sum c_i^2). This is directly applicable to learning theory: the empirical risk of a hypothesis h on a sample (X_1, y_1), ..., (X_n, y_n) is a function with bounded differences (changing one example changes the risk by at most 1/n), so McDiarmid gives the concentration of empirical risk around true risk. This is the foundation of generalization bounds.
Bernstein's inequality refines Hoeffding when the variance is known to be small: P(mean - mu >= t) <= exp(-nt^2 / (2(sigma^2 + t/3))). When sigma^2 << 1/4 (the maximum variance for [0,1] variables), Bernstein's bound is substantially tighter. This matters in learning theory when the best hypothesis has low error — the variance of its indicator variable is approximately p(1-p), which is small when p (the error) is small. Bernstein-based bounds give faster rates (O(1/n) instead of O(1/sqrt(n))) in the low-error regime, capturing the intuition that learning is easier when the target is nearly deterministic. Together, these three inequalities form the probabilistic toolkit that every subsequent topic in this course draws upon.