A machine learning engineer applies an RBF kernel SVM to a dataset that is not linearly separable in 2D. What actually happens during training that allows the SVM to find a non-linear decision boundary?
AThe data points are explicitly transformed to an infinite-dimensional space, and a standard linear SVM is trained there
BThe SVM retrains multiple times with different hyperplane orientations until a curved boundary is discovered
CEvery dot product in the SVM optimization is replaced by a kernel evaluation k(x_i, x_j), so the algorithm behaves as if it is operating in a high-dimensional space while computing only in the original space
DThe RBF kernel compresses the data into a lower-dimensional space where classes become linearly separable
The kernel trick is about *implicit* computation. The SVM formulation only requires dot products between pairs of training points, never the coordinates themselves. Replacing each dot product with k(x_i, x_j) makes the algorithm behave as if it mapped the data to the high-dimensional (or infinite-dimensional) feature space — but the explicit mapping φ is never computed. Option A is the expensive approach the kernel trick was invented to avoid. Option D reverses the geometry: the kernel maps to higher dimensions, not lower.
Question 2 Multiple Choice
Why does the polynomial kernel k(x, y) = (x·y + 1)² provide a computational advantage over explicitly mapping data points to the polynomial feature space before computing dot products?
AThe polynomial kernel produces more accurate results than explicit feature mapping because it avoids rounding errors
BThe kernel function computes exactly the same value as the dot product in the expanded feature space, but using only the original coordinates — avoiding the cost of constructing and storing the high-dimensional feature vectors
CThe polynomial kernel reduces the data's dimensionality as a form of implicit regularization
DThe kernel function uses matrix decomposition to skip the dot product computation entirely
The mathematical output is identical: k(x, y) = φ(x)·φ(y) exactly. The kernel trick is not an approximation — it produces the exact same result as the explicit expansion. The advantage is purely computational: instead of expanding a 2D point to a 6D (or higher) feature vector and then taking a dot product, you compute a single scalar from the original 2D coordinates. As feature space dimensionality grows (especially to infinity in the RBF case), this shortcut becomes essential.
Question 3 True / False
The kernel trick can be applied to any machine learning algorithm to make it work in high-dimensional feature spaces.
TTrue
FFalse
Answer: False
The kernel trick only works for algorithms that can be expressed entirely in terms of dot products between data points — this is called the 'kernel-compatible' or 'kernelizable' formulation. SVMs qualify because their dual optimization depends only on inner products. Algorithms that require explicit feature coordinates (such as naive Bayes or most tree-based methods) cannot benefit from the kernel trick. The requirement for a dot-product-based formulation is a precondition, not a universal property of all learning algorithms.
Question 4 True / False
Increasing the γ (gamma) parameter of an RBF kernel in an SVM generally produces a smoother, simpler decision boundary.
TTrue
FFalse
Answer: False
Larger γ means each training point's 'influence' drops off rapidly with distance — its kernel value approaches zero for points more than a short distance away. This makes the model highly sensitive to local training data, producing a complex, wiggly decision boundary that closely wraps around training points. This typically leads to overfitting. Smaller γ spreads each point's influence broadly, producing smoother and more global decision boundaries with better generalization. The effect is the opposite of what many students intuitively expect.
Question 5 Short Answer
Explain what it means for a kernel function to 'implicitly compute a dot product in a high-dimensional feature space,' and why this matters for learning non-linear decision boundaries.
Think about your answer, then reveal below.
Model answer: A kernel function k(x, y) is mathematically equal to φ(x)·φ(y), where φ maps data points to a (possibly infinite-dimensional) feature space. The kernel computes this inner product value using only the original coordinates of x and y, without ever constructing the feature vectors φ(x) and φ(y). This matters because the SVM optimization problem is expressed entirely in terms of these inner products, so substituting kernel evaluations for dot products makes the SVM behave as though it found a linear hyperplane in the high-dimensional feature space — which, when viewed back in the original space, corresponds to a non-linear decision boundary. The implicit nature of the mapping keeps computation feasible even when the feature space is infinite-dimensional.
The 'trick' is that the algorithm never needs to know φ explicitly — only k. This decouples the expressive power of the feature space (determined by kernel choice) from the computational cost (always determined by operations in the original input space).