A topic in the Open Knowledge Graph — a free, open map of 15,290 topics and the order to learn them in.

Support Vector Machines

Graduate Depth 98 in the knowledge graph ☐ I know this ☆ Set as goal

18topics build on this

694prerequisites beneath it

Optimization Problems Constrained Optimization Applications +6 more→→Kernel Methods and the Kernel Trick Support Vector Regression

Core Idea

SVMs find hyperplanes maximizing the margin between classes. Soft-margin SVMs tolerate misclassification via slack variables. Kernels map to high-dimensional spaces enabling non-linear classification without explicit computation.

Explainer

The intuition behind SVMs starts with a simple question: given two linearly separable classes, which separating hyperplane should you choose? Many hyperplanes separate the training data correctly, but some will fail on new points that are slightly off from what was seen during training. SVMs resolve this ambiguity by choosing the hyperplane that maximizes the *margin* — the distance between the boundary and the nearest training points on each side. A wider margin means the classifier is more robust: a test point can deviate further from the training distribution before being misclassified.

The training points that sit exactly on the margin boundaries are called *support vectors*, and they are the only points that matter for determining the hyperplane. Every other correctly classified point is irrelevant — you could remove it from the dataset and get the same model. This sparsity is both elegant and practical: at prediction time, you only need to store and compute distances to the support vectors, not to the full training set.

Real data is rarely perfectly separable, which is where the soft-margin SVM comes in. Slack variables ξᵢ allow individual points to violate the margin or even cross the decision boundary, but each violation is penalized. The regularization parameter C controls the tradeoff: large C penalizes violations heavily (the model tries hard to classify everything correctly, risking overfitting); small C allows more violations in exchange for a wider margin (the model generalizes better but may misclassify some training points). Choosing C is one of the main hyperparameter decisions in SVM training.

The kernel trick extends SVMs to non-linear boundaries without explicitly constructing a high-dimensional feature space. The SVM optimization and prediction formulas depend on the data only through pairwise inner products. If you replace each inner product ⟨xᵢ, xⱼ⟩ with a kernel function k(xᵢ, xⱼ) — which computes the inner product of the data in some (possibly infinite-dimensional) feature space — you get an SVM that finds non-linear boundaries in the original space. The RBF kernel k(x, x') = exp(−γ‖x − x'‖²) is the most common choice and can separate any distribution that has a smooth density structure.

SVMs were the dominant classification method before deep learning became practical. They remain valuable when data is high-dimensional relative to sample size (text classification, bioinformatics), when interpretability matters (support vectors have geometric meaning), and when you lack the labeled data to train deep networks. Understanding SVMs also gives you insight into the geometry of classification: the concept of margin, the duality between the primal and dual problems, and the kernel trick are ideas that reappear throughout machine learning theory.

Practice Questions 3 questions

Prerequisite Chain

Understanding Zero → The Number Zero → Counting to Five → Counting to 10 → Counting to 20 → Counting a Set of Objects Up to 20 → Cardinality: The Last Number Counted → Matching Numerals to Quantities → Subitizing Small Quantities → Addition Within 10 → Number Bonds to 10 → Addition Within 20 → Doubles and Near Doubles → Doubles Facts Within 10 → Near Doubles Facts Within 20 → Mental Math Strategies for Addition → Mental Math: Adding and Subtracting Tens → Addition Within 100 → Repeated Addition as Multiplication → Multiplication as Equal Groups → Multiplication: Arrays → Basic Multiplication Facts (0s, 1s, 2s, 5s, 10s) → Multiplication Facts Within 100 → Division as Equal Sharing → Division as Grouping (Measurement Division) → Division: Grouping (Repeated Subtraction) Model → Division: Fair Sharing Model → Division as Equal Sharing → Division as Grouping → Basic Division Facts → Division Facts Within 100 → Multiplication and Division Fact Families → Relationship Between Multiplication and Division → Division Facts as Inverse of Multiplication → Remainders and Quotients in Division → Division Word Problems → Multi-Step Word Problems → Solving Multi-Step Word Problems → Multiplication Word Problems → Division Word Problems → Introduction to Long Division → Factors and Multiples → Prime and Composite Numbers → Equivalent Fractions → Relating Fractions and Decimals → Decimal Place Value → Integers and the Number Line → Comparing and Ordering Integers → Absolute Value → Adding Integers → Subtracting Integers → Multiplying Integers → Dividing Integers → Unit Rates → Proportions → Percent Concept → Converting Between Fractions, Decimals, and Percents → Operations with Rational Numbers → Two-Step Equations → Solving Multi-Step Equations → Equations with Variables on Both Sides → Angle Pairs: Complementary, Supplementary, and Vertical → Parallel Lines and Transversals → Corresponding Angles → Alternate Interior Angles → Triangle Angle Sum Theorem → Exterior Angle Theorem → Triangle Inequality Theorem → Similar Triangles: AA Similarity → Similar Triangles: SSS and SAS Similarity → Proportions in Similar Triangles → Right Triangle Trigonometry Introduction → Sine, Cosine, and Tangent Ratios → Trigonometric Ratios Review → Radian Measure → Converting Between Degrees and Radians → The Unit Circle → Graphing Sine and Cosine → Graphing Tangent and Reciprocal Trigonometric Functions → Derivatives of Trigonometric Functions → Antiderivatives → Indefinite Integrals → Basic Integration Rules → Riemann Sums → Definite Integral Definition → Probability Density Functions and Continuous Distributions → Cumulative Distribution Functions → Continuous Random Variables → Probability Density Functions → Expected Value → Linear Regression in Machine Learning → Neural Network Fundamentals → Backpropagation Algorithm → Multilayer Perceptrons (MLPs) → Activation Functions in Neural Networks → Vanishing Gradient Problem → Gradient Descent and Optimization → Gradient Boosting Machines → Support Vector Machines

Longest path: 99 steps · 694 total prerequisite topics

Prerequisites (8)

Optimization Problemshard Dot Product (Inner Product in R^n)soft Vector Spacessoft Constrained Optimization Applicationssoft Inner Product Spacessoft Optimization in Multiple Variablessoft Matrix Operationssoft Gradient Boosting Machinessoft

Leads To (2)

Kernel Methods and the Kernel Trickhard Support Vector Regressionhard