Questions: Information Criteria: AIC and BIC for Model Selection
5 questions to test your understanding
Score: 0 / 5
Question 1 Multiple Choice
A researcher estimates Model A on one dataset and gets AIC = −150. She estimates Model B on a different dataset and gets AIC = −200. She concludes Model B fits its data better. What is wrong with this reasoning?
AAIC can only be used when models are nested; non-nested models require a different criterion
BShe should compare BIC values instead of AIC for cross-dataset comparisons
CAIC values are not comparable across different datasets — only differences between models estimated on the same data are meaningful
DA lower AIC always indicates worse fit, so Model A is actually the better model
AIC values have no meaningful absolute interpretation. The scale depends on the dataset, the likelihood function, and the number of observations. Comparing AIC = −150 from one study to AIC = −200 from another is like comparing exam scores from two different tests with different grading scales. Only *differences* in AIC between models fitted to the exact same dataset can guide model selection. This is stated explicitly in the Common Misconceptions section and is the most frequent misuse of information criteria in practice.
Question 2 Multiple Choice
You are comparing five regression models on the same dataset. Their AIC values are −410, −408, −397, −385, and −420. Using the standard rule of thumb (|ΔAIC| < 2 suggests equivalent models, |ΔAIC| > 10 suggests strong evidence), which pair of models is effectively equivalent?
AAIC = −410 and AIC = −408 (difference = 2)
BAIC = −408 and AIC = −397 (difference = 11)
CAIC = −397 and AIC = −385 (difference = 12)
DAIC = −420 and AIC = −410 (difference = 10)
By the rule of thumb, a difference of ≤ 2 in AIC indicates the models are roughly equivalent in their fit-complexity tradeoff — neither is clearly preferred. The pair with AIC = −410 and −408 has a difference of exactly 2, placing it at the edge of equivalence. The pairs with differences of 10–12 represent moderate to strong evidence in favor of the lower-AIC model. Note that the model with AIC = −420 is the best overall — lower (more negative) AIC is always better.
Question 3 True / False
BIC penalizes each additional parameter more heavily than AIC when the sample size is larger than about 8 observations, because its complexity penalty grows with the logarithm of sample size.
TTrue
FFalse
Answer: True
AIC's penalty for each additional parameter is a fixed 2 regardless of sample size. BIC's penalty is ln(n) per parameter, which exceeds 2 once n > e² ≈ 7.4. For typical econometric datasets with hundreds or thousands of observations (ln(1000) ≈ 6.9), BIC imposes roughly 3–4 times the per-parameter penalty of AIC. This is why BIC consistently selects sparser models than AIC in large samples, and why the two criteria increasingly disagree as sample size grows.
Question 4 True / False
A model with AIC = −400 is preferable to one with AIC = −200, regardless of which dataset each was estimated on.
TTrue
FFalse
Answer: False
AIC values from different datasets cannot be compared. The absolute value of AIC depends on the number of observations, the scale of the likelihood, and the distributional assumptions — none of which are held constant across different datasets. Only within-dataset comparisons are meaningful. This is perhaps the most common misuse of information criteria: treating AIC as an absolute measure of model quality rather than a relative tool for comparing models on the same data.
Question 5 Short Answer
A researcher is building a model to predict next quarter's GDP growth and wants to select among several specifications. A colleague is trying to identify which macroeconomic variables are 'truly' causal drivers of growth. Should they use the same criterion (AIC or BIC)? Explain why or why not.
Think about your answer, then reveal below.
Model answer: No. The prediction-focused researcher should use AIC, which is calibrated to minimize out-of-sample prediction error and tolerates slightly more complex models. The causal-identification researcher should use BIC, which under certain conditions selects the 'true' model (the data-generating process) with higher probability, especially in large samples, because its heavier penalty avoids including spurious variables. The goals differ: minimizing forecast error versus identifying the correct structural relationships.
AIC and BIC optimize for different things. AIC minimizes expected Kullback-Leibler divergence between the fitted model and the true data-generating process — a measure of predictive accuracy. BIC is derived from Bayesian model selection and, under regularity conditions, is consistent: as sample size grows, BIC selects the true model (if it is in the candidate set) with probability approaching 1. When they disagree, the choice of criterion should reflect the researcher's actual goal, not a default preference.