A topic in the Open Knowledge Graph — a free, open map of 15,290 topics and the order to learn them in.

Minimum Description Length

Research Depth 98 in the knowledge graph ☐ I know this ☆ Set as goal

2topics build on this

628prerequisites beneath it

Kolmogorov Complexity and Algorithmic Information Theory Source Coding Theorem +1 more→→Information Theory and Statistical Inference

Core Idea

The Minimum Description Length (MDL) principle is a practical formalization of Occam's razor: choose the model that minimizes the total description length of the model plus the data encoded relative to that model. MDL = L(M) + L(D|M), where L(M) is the length of the model description and L(D|M) is the length of the data description given the model. This provides a principled model selection criterion that automatically trades off fit and complexity — more complex models can explain data in fewer bits (lower L(D|M)), but require longer descriptions themselves (higher L(M)). MDL is motivated by Kolmogorov complexity and Solomonoff induction (the universal prior 2^(-K(x)) assigns exponentially higher weight to simpler explanations), and is related to the Bayesian information criterion (BIC) and information-theoretic measures like AIC. Unlike likelihood-based approaches, MDL is not asymptotic — it applies to finite samples — and avoids overfitting by penalizing model complexity directly.

Explainer

Modern machine learning faces a fundamental challenge: overfitting. A sufficiently complex model can memorize any training data, achieving zero training error while failing on new data. How do you choose model complexity automatically?

The Minimum Description Length principle answers this by formalizing Occam's razor as a compression problem. The description of data D given model M consists of two parts:

1. Model description L(M): How many bits to describe the model itself (its parameters, structure, hyperparameters)?

2. Data description L(D|M): How many bits to describe the data, compressed using the model?

The total is MDL = L(M) + L(D|M). The principle: choose the model that minimizes this total. This balances two competing objectives. A simple model has low L(M) but may not fit the data well, requiring many bits to describe the residuals (high L(D|M)). A complex model has high L(M) but can fit the data closely (low L(D|M)). The optimal model trades off these two costs.

Practical Implementation:

L(M) = cost to encode the model parameters (often proportional to the number of parameters k, scaled by log of precision needed, e.g., k*log(n) bits for n samples).
L(D|M) = bits needed to encode the prediction errors given the model, often via arithmetic coding or similar compression.

For a dataset with n samples, a 2-parameter linear model might require 10 bits to specify parameters precisely, plus 1000 bits to encode residuals (total 1010). A 10-parameter polynomial might require 50 bits for parameters, plus 800 bits for residuals (total 850). MDL chooses the polynomial. As the polynomial's complexity grows further, L(M) increases rapidly. At some point, the savings in L(D|M) do not compensate — MDL stops and selects the best model.

Connection to Bayesian Methods:

MDL is closely related to the Bayesian information criterion (BIC) and Bayesian model selection. A Bayesian assigns a prior to each model and computes the posterior given data. The model with the highest posterior is selected. When the prior is a uniform distribution over models, and we use a data likelihood that corresponds to a compression scheme, MDL emerges naturally. More formally, Bayesian model selection under the "universal prior" 2^(-K(M)) (from Kolmogorov complexity) is equivalent to MDL.

Comparison to Information Criteria:

AIC (Akaike Information Criterion) = -2*log-likelihood + 2k. Penalizes complexity by the number of parameters, without sample size dependence. Tends to overfit for finite samples.
BIC = -2*log-likelihood + k*log(n). Penalizes complexity scaled by sample size. Asymptotically equivalent to MDL for large n.
MDL = L(M) + L(D|M). Directly minimizes description length, applies to finite samples, and is not restricted to parametric models.

Advantages:

Avoids overfitting without needing a separate validation set.
Applies to non-parametric models (decision trees, neural networks) where the number of "parameters" is ambiguous.
Theoretically justified through Kolmogorov complexity and algorithmic probability.
Naturally handles model comparison, feature selection, and architecture search.

Limitations:

Requires computing or approximating L(D|M), which depends on the choice of compression scheme (Huffman, arithmetic coding). Different schemes give slightly different results.
Computationally expensive for some models (requires iterating to find optimal parameter encodings).
The "true" MDL involves uncomputable Kolmogorov complexity; practical MDL uses approximations.

The principle has been successfully applied to feature selection, model architecture search, clustering, and pattern discovery. It provides a unified framework for learning that, unlike empirical risk minimization, directly encodes the philosophical principle that simpler explanations are preferable — a principle with deep roots in information theory and mathematical foundations of science.

Practice Questions 4 questions