Questions — Knowledge Distillation

Question 1 Multiple Choice

A student model is trained twice: once on hard labels (one-hot vectors) and once on soft targets from a large teacher model, using the same architecture and training set. Why does the soft-target student typically generalize better?

ASoft targets provide gradient signal for every output class on every example, teaching inter-class similarity rather than just the correct answer

BSoft targets reduce the learning rate implicitly, preventing the student from overfitting

CThe teacher model acts as data augmentation by generating additional training samples

DSoft targets allow the student to match the teacher's weight values directly, bypassing normal gradient descent

Question 2 Multiple Choice

What is the effect of increasing the temperature parameter T applied to the teacher's softmax during distillation?

AIt sharpens the teacher's distribution, making the correct class probability approach 1.0

BIt flattens the teacher's distribution, making inter-class probability differences more visible to the student

CIt increases the weight of the hard label loss relative to the soft target loss

DIt reduces the number of training epochs needed to achieve convergence

Question 3 True / False

Raising the temperature parameter in distillation makes the teacher's output distribution more peaked, concentrating probability on the correct class.

TTrue

FFalse

Question 4 True / False

The 'dark knowledge' transferred during distillation refers to information about which incorrect classes the teacher considers plausible — information that a hard one-hot label cannot convey.

TTrue

FFalse

Question 5 Short Answer

Why do soft targets from a teacher model provide richer supervision than hard labels, even when both identify the same correct class for every training example?

Think about your answer, then reveal below.

Questions: Knowledge Distillation