A binary classifier is trained with cross-entropy loss. After 20 epochs, training loss has dropped from 0.9 to 0.3, but training accuracy has stayed at 85% for the last 15 epochs. Which of the following best explains this pattern?
AThe model has overfit: the loss decrease is spurious because the model memorized training labels
BThe model is becoming better calibrated — its probability estimates are growing more confident and accurate — without changing which class it predicts as most likely; loss and accuracy measure different things
CThere is a bug in the loss calculation; accuracy and loss should always move together during training
DThe learning rate is too high, causing loss to decrease while accuracy oscillates around the same value
This is the key insight about loss vs. metrics: they measure different things. Cross-entropy rewards confident correct predictions — a prediction of 0.99 for the correct class has lower loss than a prediction of 0.6, even though both produce the same classification decision. As the model becomes better calibrated (more confident where it is already correct), loss decreases without any change in the classification boundary. Accuracy only changes when predictions flip from wrong to right (or vice versa). The two metrics can diverge substantially, especially when the model is already classifying most examples correctly but could still improve its probability estimates.
Question 2 Multiple Choice
A team is building a model to predict house prices. They consider MSE and Huber loss. Their dataset contains a few extreme outliers — houses sold at ten times the typical price due to unusual circumstances. Why might Huber loss be preferable to MSE here?
AHuber loss ignores all errors below a threshold delta, so outliers that fall below the threshold do not affect training
BMSE squares large errors, so the extreme outliers generate enormous gradients that dominate the weight updates and pull the model toward fitting the outliers; Huber loss caps large-error gradients (acting like MAE above delta), limiting the influence of outliers while preserving smooth MSE-like gradients near the minimum
CHuber loss automatically removes outliers from the training batch before computing gradients
DMSE is unbounded, so training diverges when outliers are present; Huber loss ensures convergence by capping total loss
MSE's squaring of errors is a double-edged sword: it creates smooth gradients near the optimum (great for convergence), but it also means a single large error can contribute more to the total loss than hundreds of typical errors. In a dataset with price outliers, the model will spend much of its training capacity fitting those few extreme values. Huber loss switches from quadratic to linear behavior above a threshold delta, limiting the gradient magnitude from outliers while preserving the smooth convergence properties of MSE for typical errors. This gives a practical balance between robustness and trainability.
Question 3 True / False
The loss function determines what the model learns to optimize during training, while accuracy and other evaluation metrics capture what you actually care about — and these two can diverge.
TTrue
FFalse
Answer: True
This is the central practical insight about loss functions. Loss is what gradient descent acts on; metrics like accuracy, F1, AUC, or precision/recall are what you ultimately evaluate. A model can decrease its cross-entropy loss (becoming more calibrated) without changing its accuracy (same examples classified correctly or incorrectly). Conversely, a small change in loss right at a decision boundary can flip many predictions and cause a large jump in accuracy. Monitoring only loss or only accuracy gives an incomplete picture of training — both matter, and understanding their relationship prevents misinterpreting training dynamics.
Question 4 True / False
Mean squared error is a good default loss function for binary classification because it directly penalizes wrong class predictions and is simpler to implement than cross-entropy.
TTrue
FFalse
Answer: False
MSE is a poor choice for classification. When used with sigmoid output, MSE produces a loss landscape with regions of very small gradients (saturation) when the model is confidently wrong — exactly where you most need large gradients to correct behavior. Cross-entropy avoids this: its gradient with respect to the output logits is simply (predicted_probability − true_label), which is large when the model is confidently wrong and naturally drives fast correction. Cross-entropy also has a principled probabilistic interpretation: minimizing it is equivalent to maximum likelihood estimation under a Bernoulli model, which is the correct objective for classification. MSE has no such interpretation for class probabilities.
Question 5 Short Answer
Explain why the choice of loss function is a design decision about model behavior, not just a technical implementation detail. Give an example where two loss functions would train models that behave differently even with identical architectures and data.
Think about your answer, then reveal below.
Model answer: The loss function defines what 'error' means — what patterns the model is rewarded for learning. Different loss functions penalize different error types differently, so identical architectures trained on identical data but with different losses will converge to models with different behaviors. Example: MAE and MSE for regression. MSE heavily penalizes large errors (through squaring), so an MSE-trained model will sacrifice accuracy on typical examples to avoid being very wrong on extremes — it minimizes the worst-case scenario. An MAE-trained model treats all error magnitudes linearly and tends toward the conditional median rather than the conditional mean, which can be more robust when large errors are noise rather than signal. The same data, the same architecture, different learned behaviors — because 'minimize error' means different things.
A deeper example: in medical diagnosis, false negatives (missed disease) may be far more costly than false positives. A standard cross-entropy loss treats them symmetrically. A custom loss that assigns 10× weight to false negatives changes the model's decision boundary toward higher sensitivity, at the cost of more false positives. This is a deliberate design choice: the loss encodes the cost structure of your problem. Understanding this is what separates a practitioner from someone who treats loss functions as black-box formalities.