Questions: Knowledge Distillation

5 questions to test your understanding

Score: 0 / 5
Question 1 Multiple Choice

A student model is trained twice: once on hard labels (one-hot vectors) and once on soft targets from a large teacher model, using the same architecture and training set. Why does the soft-target student typically generalize better?

ASoft targets provide gradient signal for every output class on every example, teaching inter-class similarity rather than just the correct answer
BSoft targets reduce the learning rate implicitly, preventing the student from overfitting
CThe teacher model acts as data augmentation by generating additional training samples
DSoft targets allow the student to match the teacher's weight values directly, bypassing normal gradient descent
Question 2 Multiple Choice

What is the effect of increasing the temperature parameter T applied to the teacher's softmax during distillation?

AIt sharpens the teacher's distribution, making the correct class probability approach 1.0
BIt flattens the teacher's distribution, making inter-class probability differences more visible to the student
CIt increases the weight of the hard label loss relative to the soft target loss
DIt reduces the number of training epochs needed to achieve convergence
Question 3 True / False

Raising the temperature parameter in distillation makes the teacher's output distribution more peaked, concentrating probability on the correct class.

TTrue
FFalse
Question 4 True / False

The 'dark knowledge' transferred during distillation refers to information about which incorrect classes the teacher considers plausible — information that a hard one-hot label cannot convey.

TTrue
FFalse
Question 5 Short Answer

Why do soft targets from a teacher model provide richer supervision than hard labels, even when both identify the same correct class for every training example?

Think about your answer, then reveal below.