Minimum Description Length

Research Depth 78 in the knowledge graph I know this Set as goal
Unlocks 2 downstream topics
MDL model selection description length compression Occam's razor algorithmic probability

Core Idea

The Minimum Description Length (MDL) principle is a practical formalization of Occam's razor: choose the model that minimizes the total description length of the model plus the data encoded relative to that model. MDL = L(M) + L(D|M), where L(M) is the length of the model description and L(D|M) is the length of the data description given the model. This provides a principled model selection criterion that automatically trades off fit and complexity — more complex models can explain data in fewer bits (lower L(D|M)), but require longer descriptions themselves (higher L(M)). MDL is motivated by Kolmogorov complexity and Solomonoff induction (the universal prior 2^(-K(x)) assigns exponentially higher weight to simpler explanations), and is related to the Bayesian information criterion (BIC) and information-theoretic measures like AIC. Unlike likelihood-based approaches, MDL is not asymptotic — it applies to finite samples — and avoids overfitting by penalizing model complexity directly.

Explainer

Modern machine learning faces a fundamental challenge: overfitting. A sufficiently complex model can memorize any training data, achieving zero training error while failing on new data. How do you choose model complexity automatically?

The Minimum Description Length principle answers this by formalizing Occam's razor as a compression problem. The description of data D given model M consists of two parts:

1. Model description L(M): How many bits to describe the model itself (its parameters, structure, hyperparameters)?

2. Data description L(D|M): How many bits to describe the data, compressed using the model?

The total is MDL = L(M) + L(D|M). The principle: choose the model that minimizes this total. This balances two competing objectives. A simple model has low L(M) but may not fit the data well, requiring many bits to describe the residuals (high L(D|M)). A complex model has high L(M) but can fit the data closely (low L(D|M)). The optimal model trades off these two costs.

Practical Implementation:

For a dataset with n samples, a 2-parameter linear model might require 10 bits to specify parameters precisely, plus 1000 bits to encode residuals (total 1010). A 10-parameter polynomial might require 50 bits for parameters, plus 800 bits for residuals (total 850). MDL chooses the polynomial. As the polynomial's complexity grows further, L(M) increases rapidly. At some point, the savings in L(D|M) do not compensate — MDL stops and selects the best model.

Connection to Bayesian Methods:

MDL is closely related to the Bayesian information criterion (BIC) and Bayesian model selection. A Bayesian assigns a prior to each model and computes the posterior given data. The model with the highest posterior is selected. When the prior is a uniform distribution over models, and we use a data likelihood that corresponds to a compression scheme, MDL emerges naturally. More formally, Bayesian model selection under the "universal prior" 2^(-K(M)) (from Kolmogorov complexity) is equivalent to MDL.

Comparison to Information Criteria:

Advantages:

Limitations:

The principle has been successfully applied to feature selection, model architecture search, clustering, and pattern discovery. It provides a unified framework for learning that, unlike empirical risk minimization, directly encodes the philosophical principle that simpler explanations are preferable — a principle with deep roots in information theory and mathematical foundations of science.

Practice Questions 4 questions

Prerequisite Chain

Counting to 10Counting to 20Understanding ZeroThe Number ZeroCounting to FiveOne-to-One CorrespondenceCombining Small Groups Within 5Addition Within 10Addition Within 20Two-Digit Addition Without RegroupingTwo-Digit Addition with RegroupingAddition Within 100Repeated Addition as MultiplicationMultiplication Facts Within 100Division as Equal SharingDivision as Grouping (Measurement Division)Division: Grouping (Repeated Subtraction) ModelDivision: Fair Sharing ModelDivision as Equal SharingDivision as GroupingBasic Division FactsDivision Facts Within 100Two-Digit by One-Digit DivisionDivision with RemaindersRemainders and Quotients in DivisionDivision Word ProblemsIntroduction to Long DivisionFactors and MultiplesPrime and Composite NumbersEquivalent FractionsRelating Fractions and DecimalsDecimal Place ValueReading and Writing DecimalsComparing and Ordering DecimalsAdding and Subtracting DecimalsMultiplying DecimalsDividing DecimalsDividing FractionsMixed Number ArithmeticOrder of OperationsInteger Order of OperationsVariable ExpressionsCombining Like TermsOne-Step EquationsTwo-Step EquationsSolving Multi-Step EquationsEquations with Variables on Both SidesAngle Pairs: Complementary, Supplementary, and VerticalParallel Lines and TransversalsCorresponding AnglesAlternate Interior AnglesTriangle Angle Sum TheoremExterior Angle TheoremTriangle Inequality TheoremSimilar Triangles: AA SimilaritySimilar Triangles: SSS and SAS SimilarityProportions in Similar TrianglesRight Triangle Trigonometry IntroductionTrigonometric Ratios ReviewRadian MeasureConverting Between Degrees and RadiansThe Unit CircleGraphing Sine and CosineGraphing Tangent and Reciprocal Trigonometric FunctionsDerivatives of Trigonometric FunctionsAntiderivativesIndefinite IntegralsBasic Integration RulesRiemann SumsDefinite Integral DefinitionProbability Density Functions and Continuous DistributionsCumulative Distribution FunctionsContinuous Random VariablesProbability Density FunctionsShannon EntropyJoint and Conditional EntropyMutual InformationKL DivergenceMinimum Description Length

Longest path: 79 steps · 480 total prerequisite topics

Prerequisites (3)

Leads To (1)