Boosting theory proves that any "weak learner" — an algorithm that performs only slightly better than random guessing — can be transformed into an arbitrarily accurate "strong learner" by combining many weak hypotheses through weighted majority voting. AdaBoost achieves this by iteratively reweighting training examples to focus on those the current ensemble gets wrong, then combining weak hypotheses with weights proportional to their accuracy. The training error decreases exponentially with the number of rounds. The generalization theory, based on margin analysis rather than VC dimension of the combined classifier, explains why boosting often does not overfit even with many rounds — the margins on training examples continue to increase.
Boosting theory addresses a foundational question: if you can only build a classifier that is slightly better than random guessing, can you somehow combine many such weak classifiers into one that is arbitrarily accurate? The answer, proved by Robert Schapire in 1990, is yes — and this equivalence between weak and strong learning is one of the deepest results in computational learning theory.
AdaBoost (Adaptive Boosting) is the practical algorithm that realizes this theoretical promise. It works in rounds. In each round t, it trains a weak learner on the training data with a specific weighting of examples. Examples that the current ensemble misclassifies receive higher weight, forcing the next weak learner to focus on the hard cases. The weak hypothesis h_t is then added to the ensemble with a weight alpha_t = (1/2) * ln((1 - epsilon_t) / epsilon_t), where epsilon_t is the weighted error of h_t. More accurate weak learners get higher weight in the final vote. The combined classifier is H(x) = sign(sum_t alpha_t * h_t(x)).
The training error analysis is clean and powerful. If each weak learner achieves error at most 1/2 - gamma on its weighted distribution, the training error of the combined classifier after T rounds is at most exp(-2 * gamma^2 * T). This exponential decay means that even a tiny edge gamma over random guessing drives the training error to zero exponentially fast. The edge gamma can be extremely small — a 51% accurate weak learner suffices — and the number of rounds T needed is proportional to 1/gamma^2. This is the "boosting" phenomenon: amplification of weak advantage into strong performance.
The generalization theory is where boosting becomes truly interesting. A naive VC dimension analysis would predict overfitting: the combined classifier has VC dimension proportional to T times the weak learner's VC dimension, so the generalization bound worsens as T grows. But empirically, boosting often does not overfit even after hundreds or thousands of rounds. The explanation comes from margin theory, developed by Schapire, Freund, Bartlett, and Lee. The margin of a training example is the confidence of the correct prediction: the weighted vote for the correct label minus the weighted vote for the incorrect label. Margin-based generalization bounds show that test error depends on the distribution of margins, not on T. As boosting continues past zero training error, it continues to increase margins — making predictions more confident — which improves the generalization bound. This insight resolved the "mystery" of boosting's resistance to overfitting and established margin theory as a central tool in learning theory.
No topics depend on this one yet.