Neural Scaling Laws

Research Depth 79 in the knowledge graph I know this Set as goal
scaling-laws deep-learning sample-complexity compute neural-networks

Core Idea

Neural scaling laws describe how neural network performance improves predictably with three factors: model size (parameters), training data size (samples), and compute budget (FLOPs). Empirically, performance follows power-law relationships: loss scales as O(N^{-alpha}) where N is the factor being scaled and alpha is typically 0.07-0.1. These laws are striking because they hold across diverse architectures (transformers, CNNs, RNNs), domains (vision, language, multimodal), and scales (millions to billions of parameters). Scaling laws enable predicting performance before training, allocating compute efficiently between model size and data, and understanding fundamental limits of deep learning.

Explainer

Neural scaling laws, extensively documented by OpenAI researchers (particularly Kaplan et al. 2020, Hoffmann et al. 2022, and subsequent work), reveal that deep learning performance is not haphazard but follows predictable, mathematical relationships. The primary finding is that loss decreases as a power law in three factors: model size (N), data size (D), and compute budget (C).

The scaling laws are typically expressed as:

where alpha_N ≈ 0.07, alpha_D ≈ 0.10, alpha_C ≈ 0.16 (for language model pretraining). These exponents are remarkably consistent across different architectures and domains, suggesting they reflect fundamental properties of learning from data.

A key insight is the Chinchilla insight from Hoffmann et al. (2022), showing that optimal performance on a fixed compute budget comes from allocating roughly equal resources to model size and data diversity. This overturned previous practice of scaling model size much more aggressively than data size. The implication: don't train a model with 175B parameters on 300B tokens; instead, train a model with ~70B parameters on a larger and more diverse dataset. This principle has guided subsequent model development and explains why competitive models are increasingly data-efficient.

Theoretically, understanding scaling laws remains incomplete. Several frameworks provide partial explanations:

1. Statistical learning theory: Generalization bounds scale with model capacity and data size, consistent with power-law scaling in the overparameterized regime.

2. Renormalization group theory: Some researchers draw parallels to phase transitions and critical phenomena in physics, where observables scale as power laws near criticality.

3. Information-theoretic bounds: Bounds on mutual information between data and model parameters suggest power-law scaling of required samples.

4. Benign overfitting: In the overparameterized regime, models can achieve zero training error while generalizing, enabled by implicit regularization that suppresses memorization of noise.

However, none of these fully explains why the exponents are as large as they are (alpha_C ≈ 0.16 is relatively steep) or why they are so consistent across domains. The mechanism by which neural networks extract structure from data with such efficiency remains partially mysterious.

Practically, scaling laws enable several capabilities:

Limitations include: scaling laws may break down at very large scales (double descent or other regime changes), they assume distribution-independent worst-case complexity but real data has structure, domain-specific scaling exponents may differ from language models, and they do not account for inference cost, interpretability, or other downstream considerations.

The discovery of neural scaling laws is among the most important recent insights in deep learning, bridging empirical machine learning practice with theoretical understanding and enabling principled resource allocation for training increasingly capable models.

Practice Questions 4 questions

Prerequisite Chain

Counting to 10Counting to 20Understanding ZeroThe Number ZeroCounting to FiveOne-to-One CorrespondenceCombining Small Groups Within 5Addition Within 10Addition Within 20Two-Digit Addition Without RegroupingTwo-Digit Addition with RegroupingAddition Within 100Repeated Addition as MultiplicationMultiplication Facts Within 100Division as Equal SharingDivision as Grouping (Measurement Division)Division: Grouping (Repeated Subtraction) ModelDivision: Fair Sharing ModelDivision as Equal SharingDivision as GroupingBasic Division FactsDivision Facts Within 100Two-Digit by One-Digit DivisionDivision with RemaindersRemainders and Quotients in DivisionDivision Word ProblemsIntroduction to Long DivisionFactors and MultiplesPrime and Composite NumbersEquivalent FractionsRelating Fractions and DecimalsDecimal Place ValueReading and Writing DecimalsComparing and Ordering DecimalsAdding and Subtracting DecimalsMultiplying DecimalsDividing DecimalsDividing FractionsMixed Number ArithmeticOrder of OperationsInteger Order of OperationsVariable ExpressionsCombining Like TermsOne-Step EquationsTwo-Step EquationsSolving Multi-Step EquationsEquations with Variables on Both SidesLiteral EquationsSlope-Intercept FormPoint-Slope FormWriting Linear EquationsParallel and Perpendicular Line SlopesGraphing Linear EquationsPiecewise FunctionsOne-Sided LimitsContinuity DefinitionLimit Definition of the DerivativePower RuleConstant Multiple and Sum/Difference RulesProduct RuleChain RuleHigher-Order DerivativesConcavity and Inflection PointsSecond Derivative TestCurve SketchingOptimization ProblemsCritical Points of Multivariable FunctionsCritical Points and Classification of ExtremaSecond Partial Test for Local Extrema (Hessian)The Hessian Matrix and Second Derivative TestUnconstrained Optimization: Finding ExtremaOptimization in Multiple VariablesSupport Vector MachinesKernel Methods and the Kernel TrickKernel Theory and RKHSRepresenter TheoremRegularization Theory (Tikhonov, Spectral)Implicit RegularizationOverparameterization TheoryNeural Scaling Laws

Longest path: 80 steps · 525 total prerequisite topics

Prerequisites (3)

Leads To (0)

No topics depend on this one yet.