Support Vector Machines

Graduate Depth 72 in the knowledge graph I know this Set as goal
Unlocks 18 downstream topics
supervised-learning classification margin-based

Core Idea

SVMs find hyperplanes maximizing the margin between classes. Soft-margin SVMs tolerate misclassification via slack variables. Kernels map to high-dimensional spaces enabling non-linear classification without explicit computation.

Explainer

The intuition behind SVMs starts with a simple question: given two linearly separable classes, which separating hyperplane should you choose? Many hyperplanes separate the training data correctly, but some will fail on new points that are slightly off from what was seen during training. SVMs resolve this ambiguity by choosing the hyperplane that maximizes the *margin* — the distance between the boundary and the nearest training points on each side. A wider margin means the classifier is more robust: a test point can deviate further from the training distribution before being misclassified.

The training points that sit exactly on the margin boundaries are called *support vectors*, and they are the only points that matter for determining the hyperplane. Every other correctly classified point is irrelevant — you could remove it from the dataset and get the same model. This sparsity is both elegant and practical: at prediction time, you only need to store and compute distances to the support vectors, not to the full training set.

Real data is rarely perfectly separable, which is where the soft-margin SVM comes in. Slack variables ξᵢ allow individual points to violate the margin or even cross the decision boundary, but each violation is penalized. The regularization parameter C controls the tradeoff: large C penalizes violations heavily (the model tries hard to classify everything correctly, risking overfitting); small C allows more violations in exchange for a wider margin (the model generalizes better but may misclassify some training points). Choosing C is one of the main hyperparameter decisions in SVM training.

The kernel trick extends SVMs to non-linear boundaries without explicitly constructing a high-dimensional feature space. The SVM optimization and prediction formulas depend on the data only through pairwise inner products. If you replace each inner product ⟨xᵢ, xⱼ⟩ with a kernel function k(xᵢ, xⱼ) — which computes the inner product of the data in some (possibly infinite-dimensional) feature space — you get an SVM that finds non-linear boundaries in the original space. The RBF kernel k(x, x') = exp(−γ‖x − x'‖²) is the most common choice and can separate any distribution that has a smooth density structure.

SVMs were the dominant classification method before deep learning became practical. They remain valuable when data is high-dimensional relative to sample size (text classification, bioinformatics), when interpretability matters (support vectors have geometric meaning), and when you lack the labeled data to train deep networks. Understanding SVMs also gives you insight into the geometry of classification: the concept of margin, the duality between the primal and dual problems, and the kernel trick are ideas that reappear throughout machine learning theory.

Practice Questions 3 questions

Prerequisite Chain

Counting to 10Counting to 20Understanding ZeroThe Number ZeroCounting to FiveOne-to-One CorrespondenceCombining Small Groups Within 5Addition Within 10Addition Within 20Two-Digit Addition Without RegroupingTwo-Digit Addition with RegroupingAddition Within 100Repeated Addition as MultiplicationMultiplication Facts Within 100Division as Equal SharingDivision as Grouping (Measurement Division)Division: Grouping (Repeated Subtraction) ModelDivision: Fair Sharing ModelDivision as Equal SharingDivision as GroupingBasic Division FactsDivision Facts Within 100Two-Digit by One-Digit DivisionDivision with RemaindersRemainders and Quotients in DivisionDivision Word ProblemsIntroduction to Long DivisionFactors and MultiplesPrime and Composite NumbersEquivalent FractionsRelating Fractions and DecimalsDecimal Place ValueReading and Writing DecimalsComparing and Ordering DecimalsAdding and Subtracting DecimalsMultiplying DecimalsDividing DecimalsDividing FractionsMixed Number ArithmeticOrder of OperationsInteger Order of OperationsVariable ExpressionsCombining Like TermsOne-Step EquationsTwo-Step EquationsSolving Multi-Step EquationsEquations with Variables on Both SidesLiteral EquationsSlope-Intercept FormPoint-Slope FormWriting Linear EquationsParallel and Perpendicular Line SlopesGraphing Linear EquationsPiecewise FunctionsOne-Sided LimitsContinuity DefinitionLimit Definition of the DerivativePower RuleConstant Multiple and Sum/Difference RulesProduct RuleChain RuleHigher-Order DerivativesConcavity and Inflection PointsSecond Derivative TestCurve SketchingOptimization ProblemsCritical Points of Multivariable FunctionsCritical Points and Classification of ExtremaSecond Partial Test for Local Extrema (Hessian)The Hessian Matrix and Second Derivative TestUnconstrained Optimization: Finding ExtremaOptimization in Multiple VariablesSupport Vector Machines

Longest path: 73 steps · 492 total prerequisite topics

Prerequisites (8)

Leads To (2)