A topic in the Open Knowledge Graph — a free, open map of 15,290 topics and the order to learn them in.

Imbalanced Classification and Class Weighting

Graduate Depth 94 in the knowledge graph ☐ I know this ☆ Set as goal

648prerequisites beneath it

Supervised Learning Fundamentals Logistic Regression for Classification→

Core Idea

In imbalanced datasets, one class vastly outnumbers others, causing models to bias toward the majority and perform poorly on minorities. Solutions include class weighting (penalizing majority errors more), oversampling minorities, undersampling majorities, and threshold adjustment. Choice depends on problem costs and data constraints.

Explainer

Imagine training a fraud detection model where only 1 in 1,000 transactions is fraudulent. A classifier that simply predicts "not fraud" for every transaction achieves 99.9% accuracy — and catches zero actual fraud. This is the fundamental problem of imbalanced classification: when one class vastly outnumbers another, standard supervised learning algorithms optimize for overall accuracy and effectively ignore the minority class. The model learns that always predicting the majority label minimizes its loss, which is technically correct but practically useless.

The most direct fix is class weighting, which adjusts the loss function so that misclassifying a minority example costs more than misclassifying a majority example. If you recall from supervised learning how the model minimizes a loss function during training, class weighting simply multiplies the loss contribution of minority samples by a factor proportional to the imbalance ratio. A dataset with 100:1 imbalance might weight minority errors 100 times more heavily, so the optimizer treats one missed fraud case as seriously as missing 100 legitimate transactions. Most classifiers — including the logistic regression classifier you may already know — accept a class_weight parameter that does exactly this.

Another family of solutions operates on the data itself rather than the loss function. Oversampling creates additional copies of minority examples (or synthesizes new ones using techniques like SMOTE, which interpolates between existing minority points in feature space). Undersampling discards majority examples to bring the class ratio closer to balance. Oversampling risks overfitting to the specific minority examples you have; undersampling throws away potentially useful majority data. Hybrid approaches combine both, and the best choice depends on dataset size — undersampling works well when you have abundant data, while oversampling helps when data is scarce.

Finally, threshold adjustment changes how the model's output probabilities translate into class predictions. By default, a classifier predicts the positive class when its estimated probability exceeds 0.5, but on imbalanced data, lowering this threshold to 0.1 or 0.05 lets the model catch more minority cases at the cost of more false positives. The right threshold depends on the relative cost of errors: missing a cancer diagnosis is far more expensive than ordering an unnecessary follow-up test. Precision-recall curves and the F1 score become essential evaluation tools here, because accuracy is misleading when classes are imbalanced.

Practice Questions 5 questions

Prerequisite Chain

Understanding Zero → The Number Zero → Counting to Five → Counting to 10 → Counting to 20 → Counting a Set of Objects Up to 20 → Cardinality: The Last Number Counted → Matching Numerals to Quantities → Subitizing Small Quantities → Addition Within 10 → Number Bonds to 10 → Addition Within 20 → Doubles and Near Doubles → Doubles Facts Within 10 → Near Doubles Facts Within 20 → Mental Math Strategies for Addition → Mental Math: Adding and Subtracting Tens → Addition Within 100 → Repeated Addition as Multiplication → Multiplication as Equal Groups → Multiplication: Arrays → Basic Multiplication Facts (0s, 1s, 2s, 5s, 10s) → Multiplication Facts Within 100 → Division as Equal Sharing → Division as Grouping (Measurement Division) → Division: Grouping (Repeated Subtraction) Model → Division: Fair Sharing Model → Division as Equal Sharing → Division as Grouping → Basic Division Facts → Division Facts Within 100 → Multiplication and Division Fact Families → Relationship Between Multiplication and Division → Division Facts as Inverse of Multiplication → Remainders and Quotients in Division → Division Word Problems → Multi-Step Word Problems → Solving Multi-Step Word Problems → Multiplication Word Problems → Division Word Problems → Introduction to Long Division → Factors and Multiples → Prime and Composite Numbers → Equivalent Fractions → Relating Fractions and Decimals → Decimal Place Value → Integers and the Number Line → Comparing and Ordering Integers → Absolute Value → Adding Integers → Subtracting Integers → Multiplying Integers → Dividing Integers → Unit Rates → Proportions → Percent Concept → Converting Between Fractions, Decimals, and Percents → Operations with Rational Numbers → Two-Step Equations → Solving Multi-Step Equations → Equations with Variables on Both Sides → Angle Pairs: Complementary, Supplementary, and Vertical → Parallel Lines and Transversals → Corresponding Angles → Alternate Interior Angles → Triangle Angle Sum Theorem → Exterior Angle Theorem → Triangle Inequality Theorem → Similar Triangles: AA Similarity → Similar Triangles: SSS and SAS Similarity → Proportions in Similar Triangles → Right Triangle Trigonometry Introduction → Sine, Cosine, and Tangent Ratios → Trigonometric Ratios Review → Radian Measure → Converting Between Degrees and Radians → The Unit Circle → Graphing Sine and Cosine → Graphing Tangent and Reciprocal Trigonometric Functions → Derivatives of Trigonometric Functions → Antiderivatives → Indefinite Integrals → Basic Integration Rules → Riemann Sums → Definite Integral Definition → Probability Density Functions and Continuous Distributions → Cumulative Distribution Functions → Continuous Random Variables → Probability Density Functions → Expected Value → Weak Law of Large Numbers → Probability Axioms and Rules → Conditional Probability → Logistic Regression for Classification → Imbalanced Classification and Class Weighting

Longest path: 95 steps · 648 total prerequisite topics

Prerequisites (2)

Supervised Learning Fundamentalshard Logistic Regression for Classificationsoft

Leads To (0)

No topics depend on this one yet.