A spam classifier computes P(spam | features) = 0.000003 and P(not spam | features) = 0.000001 for a particular email. The classifier marks the email as spam. Despite the probabilities being wildly inaccurate (the true spam probability is 0.97), this classification is correct. Why?
ALaplace smoothing corrected the probability estimates before classification
BClassification only requires the correct class to have the highest score — even poorly calibrated probabilities preserve the correct ordering
CWorking in log space normalizes the probabilities before the decision is made
DThe naive Bayes assumption ensures probability estimates are accurate enough for practical classification
This is the key reason naive Bayes works despite its violated independence assumption. Classification is an argmax operation: pick the class with the highest posterior probability. Even if all probabilities are orders of magnitude wrong, the *ranking* of class posteriors is often preserved. As long as the spam class gets the highest score (even if that score is 0.000003 vs 0.000001), the decision is correct. This is why naive Bayes is called a 'good classifier but bad estimator' — it gets decisions right far more often than its probability estimates would suggest.
Question 2 Multiple Choice
A text classifier uses naive Bayes with a vocabulary of 50,000 words and 10 class labels. How many likelihood parameters must be estimated for the class-conditional distributions P(feature | class)?
A50,000 — one probability per word regardless of class
B500,000 — one probability per word-class combination
C50,000^10 — the full joint distribution across all words for each class
D10 — one class-conditional distribution treated as a single parameter
With the naive Bayes independence assumption, P(X₁, X₂, ..., Xₙ | C) = P(X₁|C) · P(X₂|C) · ... · Pₙ(Xₙ|C). So we need one P(word_i | class_j) for each of the 50,000 words × 10 classes = 500,000 parameters. Without the independence assumption, estimating the full joint distribution P(X₁, X₂, ..., X₅₀,₀₀₀ | C) would require an astronomically large number of parameters — effectively impossible with any realistic training set. The independence assumption reduces an intractable estimation problem to a tractable one.
Question 3 True / False
Naive Bayes requires that its conditional independence assumption holds approximately in the data for it to achieve good classification accuracy.
TTrue
FFalse
Answer: False
This is the central misconception about naive Bayes. The independence assumption is routinely and often dramatically violated in practice. In spam classification, words like 'free,' 'click,' and 'offer' co-occur far more than independence would predict. Yet naive Bayes still achieves competitive classification accuracy. The reason: classification is an argmax, not a probability estimate. As long as the independence violations don't flip which class receives the highest score — and empirically they often don't — the classifier gets the decision right. Accuracy and calibration are distinct: naive Bayes is poorly calibrated but often correctly ranked.
Question 4 True / False
Without Laplace smoothing, a single word that appears in training data for class A but never for class B will cause naive Bayes to assign zero probability to class B for any document containing that word, regardless of all other evidence.
TTrue
FFalse
Answer: True
This is the zero-frequency (or zero-count) problem. Without smoothing, P(word | class B) = 0/n = 0. Since naive Bayes multiplies likelihoods together, one zero factor zeros out the entire product: P(word₁|B) · 0 · P(word₃|B) · ... = 0. No matter how strongly all other words favor class B, this single unseen word makes P(B|document) = 0. Laplace smoothing (adding a small count, typically 1, to every feature-class combination) ensures no probability is exactly zero, so evidence from all features can contribute to the final decision.
Question 5 Short Answer
Explain why naive Bayes is described as a 'good classifier but bad estimator.' What does this mean, and why does the independence assumption's violation not necessarily impair classification performance?
Think about your answer, then reveal below.
Model answer: Naive Bayes is a 'good classifier' because it frequently assigns the highest posterior probability to the correct class, leading to correct decisions. It is a 'bad estimator' because the actual probability values it produces are often wildly miscalibrated — the true probability might be 0.97 but naive Bayes estimates 0.00003. The reason violation of the independence assumption doesn't always impair classification: argmax only requires that the correct class ranks first, not that probabilities are accurate. Even when feature co-occurrences violate independence (causing probability estimates to be wrong), the relative ordering of class posteriors is often still correct. Where the assumption fails enough to flip rankings, accuracy does degrade — but in many practical domains, particularly text, the ranking is robust to the violated assumption.
This distinction between classification accuracy and probability calibration is fundamental in machine learning. When you need well-calibrated probabilities (e.g., for risk scoring or cost-sensitive decisions), naive Bayes is insufficient and isotonic regression or Platt scaling is used to post-process its outputs. But for pure classification tasks, naive Bayes remains competitive and is often the right choice due to its speed and data efficiency.