In a trained hard-margin SVM, which training points directly determine the decision boundary?
AAll training points on the correct side of the boundary
BOnly the misclassified training points
CThe training points lying exactly on the margin boundaries (support vectors)
DA random subset of training points selected during optimization
The SVM hyperplane is fully determined by the support vectors — the subset of training points closest to the decision boundary, lying exactly on the margin edges. All other correctly classified points can be removed from the training set without changing the learned hyperplane. This is why SVMs are memory-efficient at test time: only support vectors need to be stored.
Question 2 True / False
In a hard-margin SVM, maximizing the margin directly reduces training error.
TTrue
FFalse
Answer: False
Hard-margin SVMs require all training points to be correctly classified by definition — training error is zero regardless of margin width. The margin is maximized subject to this zero-error constraint. The margin does not measure how well the model fits training data; it measures how robust the boundary is to small perturbations, which relates to generalization (test error), not training error. Confusing margin size with training error is a common misconception.
Question 3 Short Answer
Why can't a standard linear SVM classify XOR-distributed data, and how does the kernel trick address this limitation?
Think about your answer, then reveal below.
Model answer: XOR data is not linearly separable — no hyperplane in the original 2D space correctly separates the two classes. The kernel trick implicitly maps data to a higher-dimensional feature space where a separating hyperplane exists. By replacing inner products in the SVM dual formulation with a kernel function k(x, x') = φ(x)·φ(x'), the algorithm finds a non-linear decision boundary in the original space without ever explicitly computing the high-dimensional mapping φ.
The kernel trick works because the SVM dual optimization and the prediction rule both involve only inner products between data points, never their explicit coordinates. Substituting a kernel function for these inner products is mathematically valid and can implicitly operate in infinite-dimensional feature spaces (as with the RBF kernel), making non-linear classification computationally feasible.