In MDL, the total description length is L(M) + L(D|M). Why is it insufficient to minimize only L(D|M) (the fit) when selecting a model?
ABecause L(D|M) is always zero for complex models
BBecause minimizing only L(D|M) leads to overfitting — every training set can be perfectly memorized by a sufficiently complex model (e.g., a lookup table), which explains the data with L(D|M) = 0 but does not generalize. MDL penalizes complexity through L(M), forcing a tradeoff.
CBecause L(D|M) cannot be computed
DBecause the data length is fixed and does not vary across models
A model that memorizes all N training examples (e.g., storing (x_i, y_i) pairs) can achieve L(D|M) ≈ 0 — perfect fit. But such a model has no predictive power on new data. MDL enforces a tradeoff: the model description L(M) would be huge (it must encode all the lookups), so the total L(M) + L(D|M) is large. A simpler model might have nonzero L(D|M) on the training set (it doesn't fit perfectly), but smaller L(M), yielding lower total description length and better generalization. This is why MDL is a principled solution to model selection: it automatically prevents overfitting without needing a separate validation set.
Question 2 True / False
MDL is asymptotically equivalent to the Bayesian information criterion (BIC), which penalizes model complexity using n*log(k) where n is the sample size and k is the model's degrees of freedom.
TTrue
FFalse
Answer: True
The connection is deep. BIC = -2*log-likelihood(D|M) + k*log(n), which penalizes model complexity logarithmically in the sample size. For large n, BIC is asymptotically equivalent to MDL when L(M) is taken as the (log) prior on model complexity, and L(D|M) is the (log) likelihood. Both methods balance fit and complexity in a way that is asymptotically optimal for selecting among a finite set of models. MDL generalizes to non-asymptotic settings and to models where the number of parameters is not fixed.
Question 3 Short Answer
Explain how MDL connects to Kolmogorov complexity and Solomonoff induction. Why does MDL approximate the theoretical optimum for learning?
Think about your answer, then reveal below.
Model answer: Kolmogorov complexity K(x) measures the shortest program that generates x. The universal prior 2^(-K(x)) assigns probability to each data sequence inversely proportional to the shortest program length. Solomonoff induction predicts by averaging over all programs consistent with observed data, weighted by this prior — it is theoretically optimal for any computable source (converges to the true distribution). MDL approximates Solomonoff induction by using practical compression schemes (e.g., arithmetic coding) in place of true K(x), which is uncomputable. The description length L(D|M) via arithmetic coding approximates I(D|M) in bits, and choosing M to minimize L(M) + L(D|M) approximates the Solomonoff prior. In this sense, MDL is a computable, practical approximation to theoretically optimal Bayesian prediction using the universal prior.
MDL makes the abstract theory of algorithmic information concrete: instead of computing K(x) (uncomputable), we use compression algorithms; instead of computing the universal prior 2^(-K(x)) (uncomputable), we optimize description length directly. The result is a practical principle that inherits the theoretical optimality of Solomonoff induction while remaining implementable.
Question 4 Multiple Choice
A dataset D of 100 samples is fit with two models: (A) a 2-parameter linear regression with total description L_A = 100 bits, (B) a 10-parameter polynomial regression with total description L_B = 150 bits. Which model does MDL prefer, and why?
AModel B, because it has more parameters and higher capacity
BModel A, because it has lower total description length (100 < 150)
CModel B, because the sample size is 100 and polynomial has better expressiveness
DThe choice depends on the likelihood values, not the description length
MDL directly compares total description lengths: L_A = 100 < L_B = 150, so MDL selects Model A. The total description includes both the model specification (L(M)) and the residuals given the model (L(D|M)). Model B's higher-dimensional parameter space requires more bits to specify, and though it might fit the data more closely (lower L(D|M)), the savings in data description do not outweigh the cost of the model description. This is how MDL automatically prevents overfitting: extra parameters must justify themselves by achieving significant compression of the data.