A topic in the Open Knowledge Graph — a free, open map of 15,290 topics and the order to learn them in.

Naive Bayes Classifier

Graduate Depth 95 in the knowledge graph ☐ I know this ☆ Set as goal

540prerequisites beneath it

Supervised Learning Fundamentals Bayes' Theorem +2 more→

Core Idea

The naive Bayes classifier uses Bayes' theorem with a strong conditional independence assumption: all features are conditionally independent given the class label. Despite this oversimplification, naive Bayes is surprisingly effective for text classification, spam detection, and other domains where features are weakly dependent; it is fast to train and requires little data.

How It's Best Learned

Implement naive Bayes for text classification and examine learned probabilities to understand which features are most predictive of each class.

Explainer

You already know Bayes' theorem: P(C|X) = P(X|C) · P(C) / P(X), where C is a class label and X is observed evidence. A Bayesian classifier uses this directly — compute the posterior probability of each class given the features and pick the most probable one. The challenge is estimating P(X|C), the likelihood of seeing a particular combination of features given the class. If X consists of hundreds of features, the joint distribution P(X₁, X₂, ..., Xₙ|C) has an astronomically large number of parameters. With realistic training set sizes, you will never observe most feature combinations, making direct estimation impossible.

The naive Bayes assumption cuts through this problem with a single bold simplification: all features are conditionally independent given the class label. This means P(X₁, X₂, ..., Xₙ|C) = P(X₁|C) · P(X₂|C) · ... · P(Xₙ|C). Instead of estimating one enormous joint distribution, you estimate n small univariate distributions — each requiring only enough data to count how often each feature value appears within each class. For text classification, this means counting word frequencies per class, which is trivially fast even for vocabularies of hundreds of thousands of words. Training reduces to counting, which is why naive Bayes is one of the fastest classifiers to fit.

The independence assumption is almost always wrong in practice. In a spam classifier, the words "free" and "click" are not independent given that the email is spam — they co-occur far more often than chance would predict. Yet naive Bayes still works remarkably well. The reason is that classification only requires getting the *ranking* of class probabilities right, not their exact values. Even when the estimated probabilities are poorly calibrated (and they typically are), the correct class often still receives the highest score. The classifier does not need the joint distribution to be accurate — it only needs the product of marginals to preserve the ordering of classes. This is why naive Bayes is called a good classifier but a bad estimator.

In practice, you need to handle two technical issues. First, smoothing: if a feature value never appears with a particular class in training data, the likelihood term is zero, which zeroes out the entire product regardless of all other evidence. Laplace smoothing (adding a small count to every feature-class combination) prevents this. Second, working in log space: multiplying many small probabilities together causes numerical underflow, so implementations sum log-probabilities instead. The classification decision becomes argmax over sums of log-likelihoods plus the log-prior — simple, fast, and numerically stable. Different variants of naive Bayes handle different feature types: multinomial naive Bayes models word counts, Bernoulli naive Bayes models binary word presence, and Gaussian naive Bayes models continuous features by fitting a normal distribution per feature per class.

Practice Questions 5 questions

Prerequisite Chain

Understanding Zero → The Number Zero → Counting to Five → Counting to 10 → Counting to 20 → Counting a Set of Objects Up to 20 → Cardinality: The Last Number Counted → Matching Numerals to Quantities → Subitizing Small Quantities → Addition Within 10 → Number Bonds to 10 → Addition Within 20 → Doubles and Near Doubles → Doubles Facts Within 10 → Near Doubles Facts Within 20 → Mental Math Strategies for Addition → Mental Math: Adding and Subtracting Tens → Addition Within 100 → Repeated Addition as Multiplication → Multiplication as Equal Groups → Multiplication: Arrays → Basic Multiplication Facts (0s, 1s, 2s, 5s, 10s) → Multiplication Facts Within 100 → Division as Equal Sharing → Division as Grouping (Measurement Division) → Division: Grouping (Repeated Subtraction) Model → Division: Fair Sharing Model → Division as Equal Sharing → Division as Grouping → Basic Division Facts → Division Facts Within 100 → Multiplication and Division Fact Families → Relationship Between Multiplication and Division → Division Facts as Inverse of Multiplication → Remainders and Quotients in Division → Division Word Problems → Multi-Step Word Problems → Solving Multi-Step Word Problems → Multiplication Word Problems → Division Word Problems → Introduction to Long Division → Factors and Multiples → Prime and Composite Numbers → Equivalent Fractions → Relating Fractions and Decimals → Decimal Place Value → Integers and the Number Line → Comparing and Ordering Integers → Absolute Value → Adding Integers → Subtracting Integers → Multiplying Integers → Dividing Integers → Unit Rates → Proportions → Percent Concept → Converting Between Fractions, Decimals, and Percents → Operations with Rational Numbers → Two-Step Equations → Solving Multi-Step Equations → Equations with Variables on Both Sides → Angle Pairs: Complementary, Supplementary, and Vertical → Parallel Lines and Transversals → Corresponding Angles → Alternate Interior Angles → Triangle Angle Sum Theorem → Exterior Angle Theorem → Triangle Inequality Theorem → Similar Triangles: AA Similarity → Similar Triangles: SSS and SAS Similarity → Proportions in Similar Triangles → Right Triangle Trigonometry Introduction → Sine, Cosine, and Tangent Ratios → Trigonometric Ratios Review → Radian Measure → Converting Between Degrees and Radians → The Unit Circle → Graphing Sine and Cosine → Graphing Tangent and Reciprocal Trigonometric Functions → Derivatives of Trigonometric Functions → Antiderivatives → Indefinite Integrals → Basic Integration Rules → Riemann Sums → Definite Integral Definition → Probability Density Functions and Continuous Distributions → Cumulative Distribution Functions → Continuous Random Variables → Probability Density Functions → Expected Value → Weak Law of Large Numbers → Probability Axioms and Rules → Conditional Probability → Law of Total Probability → Bayes' Theorem → Naive Bayes Classifier

Longest path: 96 steps · 540 total prerequisite topics

Prerequisites (4)

Supervised Learning Fundamentalshard Bayes' Theoremsoft Conditional Probabilitysoft Probability Axiomssoft

Leads To (0)

No topics depend on this one yet.