A topic in the Open Knowledge Graph — a free, open map of 15,290 topics and the order to learn them in.

Multicollinearity

College Depth 114 in the knowledge graph ☐ I know this ☆ Set as goal

72topics build on this

583prerequisites beneath it

Correlation Coefficient Multiple Regression +2 more→→Multicollinearity: Detection Using VIF Ridge, Lasso, and Elastic Net Regression +2 more

Core Idea

Multicollinearity arises when two or more regressors are highly (but not perfectly) correlated, making it difficult for OLS to separately identify their individual effects. It inflates standard errors, widens confidence intervals, and makes individual t-tests unreliable — but it does not bias the coefficient estimates. Variance Inflation Factors (VIFs) quantify how much each regressor's standard error is inflated relative to the case of no correlation. Perfect multicollinearity (e.g., including both a variable and its exact linear combination) makes (X'X) singular and OLS undefined.

Common Misconceptions

Multicollinearity is a data problem, not a model misspecification — it does not violate any Gauss-Markov assumption.
Dropping a correlated variable 'fixes' multicollinearity but may introduce omitted variable bias.

Explainer

Suppose you are regressing a worker's wage on both years of education and a cognitive test score. These two variables are positively correlated — people with more education tend to score higher. Now imagine you ask OLS to tell you: "How much does an extra year of education raise wages, holding the test score fixed?" The model must find observations where education increases but test scores do not change — comparisons that may be rare in the data because the two variables tend to move together. When the variables are highly correlated, OLS struggles to separately attribute wage variation to education versus the test score. This is the essence of multicollinearity: not a model error, but a data problem — the information needed to cleanly identify separate effects is thin.

The consequence shows up in the standard errors, not in the estimates themselves. OLS coefficient estimates remain unbiased and consistent even under severe multicollinearity — the Gauss-Markov conditions are not violated, so OLS is still BLUE. But the estimates become imprecise. Intuitively, when the model cannot distinguish education's effect from the test score's effect, it produces wide confidence intervals around both. You'll see large standard errors, high p-values that fail to reject H₀ for individual coefficients, and wide confidence intervals — even though R² and the overall F-statistic may remain high. This pattern is a diagnostic fingerprint: statistically insignificant individual coefficients paired with a significant overall F-test often indicates multicollinearity.

The Variance Inflation Factor (VIF) quantifies this precisely. For each regressor, VIF measures how much its variance (squared standard error) is inflated relative to what it would be if that regressor were uncorrelated with all others. A VIF of 1 means no inflation; a VIF of 10 means the standard error is √10 ≈ 3.16 times larger than it would be in an ideal orthogonal design. The formula is VIF_j = 1 / (1 - R²_j), where R²_j is the R-squared from regressing variable j on all other regressors. High R²_j means variable j is nearly a linear combination of the others — exactly the problem.

The response to multicollinearity requires care. The naive fix — dropping one of the correlated variables — does reduce standard errors, but at the cost of omitted variable bias if the dropped variable actually belongs in the model. The cleaner solutions are: collect more data (larger samples improve precision even when correlation persists), use ridge regression or other shrinkage methods that trade some bias for variance reduction, or reconsider whether the model is asking for a finer distinction than the data can support. Sometimes multicollinearity is telling you that two theoretical constructs are operationally inseparable in your dataset — a substantive finding, not just a statistical nuisance.

Practice Questions 5 questions

Prerequisite Chain

Understanding Zero → The Number Zero → Counting to Five → Counting to 10 → Counting to 20 → Counting a Set of Objects Up to 20 → Cardinality: The Last Number Counted → Matching Numerals to Quantities → Subitizing Small Quantities → Addition Within 10 → Number Bonds to 10 → Addition Within 20 → Doubles and Near Doubles → Doubles Facts Within 10 → Near Doubles Facts Within 20 → Mental Math Strategies for Addition → Mental Math: Adding and Subtracting Tens → Addition Within 100 → Repeated Addition as Multiplication → Multiplication as Equal Groups → Multiplication: Arrays → Basic Multiplication Facts (0s, 1s, 2s, 5s, 10s) → Multiplication Facts Within 100 → Division as Equal Sharing → Division as Grouping (Measurement Division) → Division: Grouping (Repeated Subtraction) Model → Division: Fair Sharing Model → Division as Equal Sharing → Division as Grouping → Basic Division Facts → Division Facts Within 100 → Multiplication and Division Fact Families → Relationship Between Multiplication and Division → Division Facts as Inverse of Multiplication → Remainders and Quotients in Division → Division Word Problems → Multi-Step Word Problems → Solving Multi-Step Word Problems → Multiplication Word Problems → Division Word Problems → Introduction to Long Division → Factors and Multiples → Prime and Composite Numbers → Equivalent Fractions → Relating Fractions and Decimals → Decimal Place Value → Integers and the Number Line → Comparing and Ordering Integers → Absolute Value → Adding Integers → Subtracting Integers → Multiplying Integers → Dividing Integers → Unit Rates → Proportions → Percent Concept → Converting Between Fractions, Decimals, and Percents → Operations with Rational Numbers → Two-Step Equations → Solving Multi-Step Equations → Equations with Variables on Both Sides → Angle Pairs: Complementary, Supplementary, and Vertical → Parallel Lines and Transversals → Corresponding Angles → Alternate Interior Angles → Triangle Angle Sum Theorem → Exterior Angle Theorem → Triangle Inequality Theorem → Similar Triangles: AA Similarity → Similar Triangles: SSS and SAS Similarity → Proportions in Similar Triangles → Right Triangle Trigonometry Introduction → Sine, Cosine, and Tangent Ratios → Trigonometric Ratios Review → Radian Measure → Converting Between Degrees and Radians → The Unit Circle → Graphing Sine and Cosine → Graphing Tangent and Reciprocal Trigonometric Functions → Derivatives of Trigonometric Functions → Antiderivatives → Indefinite Integrals → Basic Integration Rules → Riemann Sums → Definite Integral Definition → Probability Density Functions and Continuous Distributions → Cumulative Distribution Functions → Continuous Random Variables → Probability Density Functions → Expected Value → Weak Law of Large Numbers → Probability Axioms and Rules → Conditional Probability → Independence of Events → Sampling Distributions → Standard Error of Estimators → Hypothesis Testing: Framework and Logic → P-values and Statistical Significance → Effect Size and Practical Significance → Hypothesis Testing: Framework and Logic → Z-Tests and T-Tests for Means → One-Sample Z-Test for Means → One-Sample and Two-Sample T-Tests → Inference in Linear Regression → Prediction Intervals in Regression → Linear Regression Basics → Residuals and Goodness of Fit (R²) → Simple (Bivariate) OLS Regression → Classical OLS Assumptions (Gauss-Markov) → Multiple Regression → Interpreting Regression Coefficients → Hypothesis Testing in Regression → F-Test and Joint Significance → R-Squared and Model Fit → Multicollinearity

Longest path: 115 steps · 583 total prerequisite topics

Prerequisites (4)

Multiple Regressionhard Correlation Coefficienthard Matrices Introductionsoft R-Squared and Model Fitsoft

Leads To (4)

Multicollinearity: Detection Using VIFhard Ridge, Lasso, and Elastic Net Regressionhard Robust Standard Errorssoft Variance Inflation Factor and Multicollinearity Diagnosishard