Information Criteria: AIC and BIC for Model Selection

College Depth 82 in the knowledge graph I know this Set as goal
Unlocks 2 downstream topics
model-selection information-criteria aic bic

Core Idea

AIC and BIC are criteria that balance fit and parsimony when choosing among competing models. Both penalize the number of parameters, with BIC imposing a stronger penalty that favors simpler models. Lower values indicate better models.

How It's Best Learned

Compare models of different complexities using AIC or BIC. Understand that AIC asymptotically selects the best predictor, while BIC is consistent for model selection when the true model is in the candidate set.

Common Misconceptions

AIC and BIC are not goodness-of-fit measures; lower values don't mean the model fits well, only that it's better relative to alternatives in the comparison set. The absolute values cannot be compared across different samples or response transformations.

Explainer

From your study of R² and adjusted R², you already know the central tension in model selection: adding regressors always improves in-sample fit, but not all of those regressors improve genuine explanatory power. Adjusted R² penalizes for extra parameters, but only for linear models estimated by OLS. Information criteria — principally AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion) — generalize this principle to any model estimated by maximum likelihood, making them applicable to logit, probit, count models, survival models, or any likelihood-based estimation.

Both criteria share the same structure: they reward fit and penalize complexity. Specifically, AIC = −2 ln(L̂) + 2k and BIC = −2 ln(L̂) + k ln(n), where L̂ is the maximized likelihood, k is the number of estimated parameters, and n is the sample size. A larger log-likelihood means better fit (less negative, so AIC and BIC go down). More parameters push AIC and BIC up. You want the model with the lowest AIC or BIC — lower means a better balance of fit and parsimony.

The key difference is the size of the penalty. BIC penalizes each parameter by ln(n) rather than 2. For any sample larger than about 8 observations, ln(n) > 2, so BIC penalizes additional parameters more harshly than AIC does. In practice, BIC tends to select simpler models. Theoretically, AIC is motivated by minimizing predictive error (it targets the approximation that best predicts new data), while BIC is motivated by identifying the true model from the candidate set (it is consistent: as n → ∞, BIC selects the true model with probability 1, if it is among the candidates). Neither goal is universally correct — the right criterion depends on whether you are building a predictive tool or testing a theoretical structure.

Two critical caveats prevent misuse. First, AIC and BIC can only be compared across models fit to the same dataset with the same response variable. Comparing AIC from a model of log(Y) to one of Y is invalid — the likelihoods live on different scales. Second, a lower AIC or BIC means only that one model is relatively better than another; it says nothing about whether either model fits well in an absolute sense. A model with AIC = 500 may be far better than AIC = 600, yet both may be terrible. Information criteria are selection tools, not validation tools — always pair them with residual diagnostics and substantive scrutiny of the winning model.

Practice Questions 5 questions

Prerequisite Chain

Counting to 10Counting to 20Understanding ZeroThe Number ZeroCounting to FiveOne-to-One CorrespondenceCombining Small Groups Within 5Addition Within 10Addition Within 20Two-Digit Addition Without RegroupingTwo-Digit Addition with RegroupingAddition Within 100Repeated Addition as MultiplicationMultiplication Facts Within 100Division as Equal SharingDivision as Grouping (Measurement Division)Division: Grouping (Repeated Subtraction) ModelDivision: Fair Sharing ModelDivision as Equal SharingDivision as GroupingBasic Division FactsDivision Facts Within 100Two-Digit by One-Digit DivisionDivision with RemaindersRemainders and Quotients in DivisionDivision Word ProblemsIntroduction to Long DivisionFactors and MultiplesPrime and Composite NumbersEquivalent FractionsRelating Fractions and DecimalsDecimal Place ValueReading and Writing DecimalsComparing and Ordering DecimalsAdding and Subtracting DecimalsMultiplying DecimalsDividing DecimalsDividing FractionsMixed Number ArithmeticOrder of OperationsInteger Order of OperationsVariable ExpressionsCombining Like TermsOne-Step EquationsTwo-Step EquationsSolving Multi-Step EquationsEquations with Variables on Both SidesAngle Pairs: Complementary, Supplementary, and VerticalParallel Lines and TransversalsCorresponding AnglesAlternate Interior AnglesTriangle Angle Sum TheoremExterior Angle TheoremTriangle Inequality TheoremSimilar Triangles: AA SimilaritySimilar Triangles: SSS and SAS SimilarityProportions in Similar TrianglesRight Triangle Trigonometry IntroductionTrigonometric Ratios ReviewRadian MeasureConverting Between Degrees and RadiansThe Unit CircleGraphing Sine and CosineGraphing Tangent and Reciprocal Trigonometric FunctionsDerivatives of Trigonometric FunctionsAntiderivativesIndefinite IntegralsBasic Integration RulesRiemann SumsDefinite Integral DefinitionProbability Density Functions and Continuous DistributionsCumulative Distribution FunctionsContinuous Random VariablesNormal DistributionCentral Limit TheoremConfidence Intervals for MeansZ-Tests and T-Tests for MeansOne-Sample Z-Test for MeansOne-Sample and Two-Sample T-TestsOne-Way ANOVAF-Test and Joint SignificanceR-Squared and Model FitInformation Criteria: AIC and BIC for Model Selection

Longest path: 83 steps · 421 total prerequisite topics

Prerequisites (2)

Leads To (1)