A logistic regression model outputs 0.73 for a patient in a cancer screening dataset. A colleague says 'the model predicted cancer.' What is missing from this interpretation?
ANothing is missing — 0.73 means the model predicts cancer, since it is greater than 0.5
BThe decision threshold: 0.73 is a probability, and classification requires a separate threshold choice. The default 0.5 is not always correct
CThe model should output 0 or 1 directly; 0.73 indicates the model is poorly calibrated
DThe model must be compared to a baseline before any prediction can be made
Logistic regression outputs P(y=1|x) — a probability, not a label. Whether 0.73 maps to 'cancer' depends on the decision threshold you choose. With a threshold of 0.5, yes, 0.73 → 'positive', but in a screening context you might lower the threshold (e.g., 0.3) to catch more true positives at the cost of more false positives. Option A embeds the assumption that 0.5 is always the threshold, which conflates the model's output with a design choice that should be made separately.
Question 2 Multiple Choice
A logistic regression is trained on two features x₁ and x₂. A student claims the decision boundary must be curved because the sigmoid function is nonlinear. Is the student correct?
AYes — the sigmoid introduces nonlinearity, so the boundary is a curve in feature space
BNo — the decision boundary is where the linear combination w₁x₁ + w₂x₂ + b = 0, which is always a straight line (or hyperplane), regardless of the sigmoid
CIt depends — the boundary is linear only if the two classes are perfectly separable
DYes — the boundary is nonlinear unless regularization is applied
The sigmoid is nonlinear in terms of the output probability, but the decision boundary is the set of points where P(y=1|x) = 0.5, which corresponds to the linear combination equaling zero. That equation defines a straight line in 2D or a hyperplane in higher dimensions — always linear. This is a fundamental limitation of logistic regression: it cannot learn XOR-like patterns without feature engineering. The nonlinearity of the sigmoid shapes the probability surface, not the decision boundary itself.
Question 3 True / False
Logistic regression directly outputs a binary classification label (0 or 1) for each input.
TTrue
FFalse
Answer: False
Logistic regression outputs a continuous probability P(y=1|x) ∈ (0,1) via the sigmoid function. Converting this to a binary label requires applying a decision threshold — typically 0.5, but this is a separate design choice that trades off precision and recall. The probabilistic output is one of logistic regression's strengths: it encodes the model's confidence, not just its direction.
Question 4 True / False
Cross-entropy loss penalizes confident wrong predictions more severely than mean squared error, making it better suited for logistic regression training.
TTrue
FFalse
Answer: True
Cross-entropy loss is −[y·log(p) + (1−y)·log(1−p)]. When the model predicts p ≈ 0.99 but the true label is y = 0, the loss is −log(0.01) ≈ 4.6 — enormous. Squared error would give (0.99)² ≈ 0.98, a moderate penalty. Cross-entropy is also derived from maximum likelihood estimation of a Bernoulli distribution and produces well-behaved gradients for sigmoid outputs, whereas squared error on sigmoid outputs can cause vanishing gradients during training.
Question 5 Short Answer
Why is mean squared error (MSE) not the standard loss function for logistic regression, even though logistic regression uses a regression-like framework?
Think about your answer, then reveal below.
Model answer: MSE applied to sigmoid outputs produces a non-convex loss surface with vanishing gradients when the sigmoid is saturated (near 0 or 1), making gradient descent unreliable. Cross-entropy loss is statistically principled — it is the negative log-likelihood of a Bernoulli distribution, which is exactly the distributional assumption behind binary classification. It is convex in the weights, guaranteeing a global optimum, and its gradient simplifies cleanly to (predicted − true) × input, making updates interpretable and efficient.
The historical name 'logistic regression' reflects its origins in regression-like linear modeling, but the task is classification and the appropriate loss function follows from the probabilistic framework. MSE treats the output as if it were a continuous measurement, which is the wrong model. Cross-entropy treats the output as a probability, which is the right model. The mismatch between loss function and output interpretation is not just aesthetic — it leads to slower convergence and can miss the global optimum entirely.