SVMs find hyperplanes maximizing the margin between classes. Soft-margin SVMs tolerate misclassification via slack variables. Kernels map to high-dimensional spaces enabling non-linear classification without explicit computation.
The intuition behind SVMs starts with a simple question: given two linearly separable classes, which separating hyperplane should you choose? Many hyperplanes separate the training data correctly, but some will fail on new points that are slightly off from what was seen during training. SVMs resolve this ambiguity by choosing the hyperplane that maximizes the *margin* — the distance between the boundary and the nearest training points on each side. A wider margin means the classifier is more robust: a test point can deviate further from the training distribution before being misclassified.
The training points that sit exactly on the margin boundaries are called *support vectors*, and they are the only points that matter for determining the hyperplane. Every other correctly classified point is irrelevant — you could remove it from the dataset and get the same model. This sparsity is both elegant and practical: at prediction time, you only need to store and compute distances to the support vectors, not to the full training set.
Real data is rarely perfectly separable, which is where the soft-margin SVM comes in. Slack variables ξᵢ allow individual points to violate the margin or even cross the decision boundary, but each violation is penalized. The regularization parameter C controls the tradeoff: large C penalizes violations heavily (the model tries hard to classify everything correctly, risking overfitting); small C allows more violations in exchange for a wider margin (the model generalizes better but may misclassify some training points). Choosing C is one of the main hyperparameter decisions in SVM training.
The kernel trick extends SVMs to non-linear boundaries without explicitly constructing a high-dimensional feature space. The SVM optimization and prediction formulas depend on the data only through pairwise inner products. If you replace each inner product ⟨xᵢ, xⱼ⟩ with a kernel function k(xᵢ, xⱼ) — which computes the inner product of the data in some (possibly infinite-dimensional) feature space — you get an SVM that finds non-linear boundaries in the original space. The RBF kernel k(x, x') = exp(−γ‖x − x'‖²) is the most common choice and can separate any distribution that has a smooth density structure.
SVMs were the dominant classification method before deep learning became practical. They remain valuable when data is high-dimensional relative to sample size (text classification, bioinformatics), when interpretability matters (support vectors have geometric meaning), and when you lack the labeled data to train deep networks. Understanding SVMs also gives you insight into the geometry of classification: the concept of margin, the duality between the primal and dual problems, and the kernel trick are ideas that reappear throughout machine learning theory.