In the Neural Tangent Kernel limit (infinite network width), what happens to the learned representations of neurons during training?
ARepresentations continuously change and adapt to the data, allowing different layers to specialize
BRepresentations are frozen after initialization; the network learns through kernel-based prediction without representation change
CRepresentations collapse to a single vector, forcing all neurons to learn identical features
DRepresentations change randomly, making learning unpredictable
A surprising insight of NTK theory is that in the infinite-width limit, neuron representations essentially freeze near their random initialization. Learning happens entirely through gradient updates to the final layer weights, which are reinterpreted as kernel method coefficients. The kernel matrix K_ij = <gradient_i, gradient_j> is fixed at initialization (for sigmoid/ReLU networks, to first order). This explains why NTK provides such accurate predictions: infinite-width networks implicitly solve a kernel problem with a fixed, data-independent kernel.
Question 2 Short Answer
Why is the Neural Tangent Kernel relevant for understanding finite-width neural networks?
Think about your answer, then reveal below.
Model answer: Finite-width networks deviate from pure NTK behavior, but the NTK provides a good approximation when width is sufficiently large. The deviation depends on the ratio feature_learning_scale / regularization_scale: when networks are very wide, feature learning is negligible and NTK behavior dominates. The NTK theory explains why wide networks generalize well despite perfect interpolation, and why depth matters even in the NTK regime (deeper networks have different kernel structures). For practical networks of moderate width, NTK is an approximation that becomes increasingly accurate as width increases.
NTK serves as an important theoretical limit and practical diagnostic tool. When your neural network is wide enough that NTK theory applies, you can predict generalization using kernel methods and RKHS theory. When NTK breaks down (e.g., small networks, deep feature learning), other phenomena like double descent become relevant. This layering of theory allows precise understanding of when different learning mechanisms dominate.
Question 3 True / False
The Neural Tangent Kernel is independent of the training data in the infinite-width limit. Does this mean the kernel is useless for learning?
TTrue
FFalse
Answer: False
Even though the kernel K is fixed (data-independent), the regression problem on top of it uses labeled data to optimize coefficients. The kernel's fixed structure still encodes inductive biases (e.g., smoothness, hierarchical feature extraction at different depths) that enable generalization. The NTK's data-independence is actually an advantage: it means you can compute the kernel matrix K once and analyze learnability without re-solving for every different target function, making learning tractable in the infinite-width regime.
Question 4 Multiple Choice
Compare NTK theory to feature learning in finite-width networks. Which statement is most accurate?
ANTK and feature learning are orthogonal; networks either exhibit one or the other
BNTK is a special case where feature learning is zero; finite networks interpolate between NTK (no learning) and full feature learning
CAll neural networks follow NTK dynamics exactly; claims of feature learning are misconceptions
DFeature learning and NTK coexist at different scales: NTK captures global optimization dynamics, feature learning captures representation changes
Modern understanding distinguishes lazy training (near-NTK regime, small learning rates, wide networks) from feature learning (moderate learning rates, reasonable width, representation evolution). These are not separate regimes but coexist: NTK theory accurately predicts loss trajectories while feature learning describes representation geometry. This multi-scale view resolves the apparent tension between NTK's frozen features and empirical observation of representation learning.