The Minimum Description Length (MDL) principle is a practical formalization of Occam's razor: choose the model that minimizes the total description length of the model plus the data encoded relative to that model. MDL = L(M) + L(D|M), where L(M) is the length of the model description and L(D|M) is the length of the data description given the model. This provides a principled model selection criterion that automatically trades off fit and complexity — more complex models can explain data in fewer bits (lower L(D|M)), but require longer descriptions themselves (higher L(M)). MDL is motivated by Kolmogorov complexity and Solomonoff induction (the universal prior 2^(-K(x)) assigns exponentially higher weight to simpler explanations), and is related to the Bayesian information criterion (BIC) and information-theoretic measures like AIC. Unlike likelihood-based approaches, MDL is not asymptotic — it applies to finite samples — and avoids overfitting by penalizing model complexity directly.
Modern machine learning faces a fundamental challenge: overfitting. A sufficiently complex model can memorize any training data, achieving zero training error while failing on new data. How do you choose model complexity automatically?
The Minimum Description Length principle answers this by formalizing Occam's razor as a compression problem. The description of data D given model M consists of two parts:
1. Model description L(M): How many bits to describe the model itself (its parameters, structure, hyperparameters)?
2. Data description L(D|M): How many bits to describe the data, compressed using the model?
The total is MDL = L(M) + L(D|M). The principle: choose the model that minimizes this total. This balances two competing objectives. A simple model has low L(M) but may not fit the data well, requiring many bits to describe the residuals (high L(D|M)). A complex model has high L(M) but can fit the data closely (low L(D|M)). The optimal model trades off these two costs.
Practical Implementation:
For a dataset with n samples, a 2-parameter linear model might require 10 bits to specify parameters precisely, plus 1000 bits to encode residuals (total 1010). A 10-parameter polynomial might require 50 bits for parameters, plus 800 bits for residuals (total 850). MDL chooses the polynomial. As the polynomial's complexity grows further, L(M) increases rapidly. At some point, the savings in L(D|M) do not compensate — MDL stops and selects the best model.
Connection to Bayesian Methods:
MDL is closely related to the Bayesian information criterion (BIC) and Bayesian model selection. A Bayesian assigns a prior to each model and computes the posterior given data. The model with the highest posterior is selected. When the prior is a uniform distribution over models, and we use a data likelihood that corresponds to a compression scheme, MDL emerges naturally. More formally, Bayesian model selection under the "universal prior" 2^(-K(M)) (from Kolmogorov complexity) is equivalent to MDL.
Comparison to Information Criteria:
Advantages:
Limitations:
The principle has been successfully applied to feature selection, model architecture search, clustering, and pattern discovery. It provides a unified framework for learning that, unlike empirical risk minimization, directly encodes the philosophical principle that simpler explanations are preferable — a principle with deep roots in information theory and mathematical foundations of science.