A topic in the Open Knowledge Graph — a free, open map of 15,290 topics and the order to learn them in.

Neural Scaling Laws

Research Depth 105 in the knowledge graph ☐ I know this ☆ Set as goal

733prerequisites beneath it

Deep Learning Theory Sample Complexity Bounds +1 more→

Core Idea

Neural scaling laws describe how neural network performance improves predictably with three factors: model size (parameters), training data size (samples), and compute budget (FLOPs). Empirically, performance follows power-law relationships: loss scales as O(N^-alpha) where N is the factor being scaled and alpha is typically 0.07-0.1. These laws are striking because they hold across diverse architectures (transformers, CNNs, RNNs), domains (vision, language, multimodal), and scales (millions to billions of parameters). Scaling laws enable predicting performance before training, allocating compute efficiently between model size and data, and understanding fundamental limits of deep learning.

Explainer

Neural scaling laws, extensively documented by OpenAI researchers (particularly Kaplan et al. 2020, Hoffmann et al. 2022, and subsequent work), reveal that deep learning performance is not haphazard but follows predictable, mathematical relationships. The primary finding is that loss decreases as a power law in three factors: model size (N), data size (D), and compute budget (C).

The scaling laws are typically expressed as:

L(N) ≈ a_N * N^-alpha_N
L(D) ≈ a_D * D^-alpha_D
L(C) ≈ a_C * C^-alpha_C

where alpha_N ≈ 0.07, alpha_D ≈ 0.10, alpha_C ≈ 0.16 (for language model pretraining). These exponents are remarkably consistent across different architectures and domains, suggesting they reflect fundamental properties of learning from data.

A key insight is the Chinchilla insight from Hoffmann et al. (2022), showing that optimal performance on a fixed compute budget comes from allocating roughly equal resources to model size and data diversity. This overturned previous practice of scaling model size much more aggressively than data size. The implication: don't train a model with 175B parameters on 300B tokens; instead, train a model with ~70B parameters on a larger and more diverse dataset. This principle has guided subsequent model development and explains why competitive models are increasingly data-efficient.

Theoretically, understanding scaling laws remains incomplete. Several frameworks provide partial explanations:

1. Statistical learning theory: Generalization bounds scale with model capacity and data size, consistent with power-law scaling in the overparameterized regime.

2. Renormalization group theory: Some researchers draw parallels to phase transitions and critical phenomena in physics, where observables scale as power laws near criticality.

3. Information-theoretic bounds: Bounds on mutual information between data and model parameters suggest power-law scaling of required samples.

4. Benign overfitting: In the overparameterized regime, models can achieve zero training error while generalizing, enabled by implicit regularization that suppresses memorization of noise.

However, none of these fully explains why the exponents are as large as they are (alpha_C ≈ 0.16 is relatively steep) or why they are so consistent across domains. The mechanism by which neural networks extract structure from data with such efficiency remains partially mysterious.

Practically, scaling laws enable several capabilities:

Compute-optimal allocation: Given a fixed budget, determine the best balance of model size and data size.
Loss prediction: Fit scaling law curves to small models and predict the loss of larger models before training.
Chinchilla scaling: Train models with balanced model-to-data ratios rather than extreme imbalances.
Efficiency analysis: Understand which resources (compute, data, model size) provide the best return.

Limitations include: scaling laws may break down at very large scales (double descent or other regime changes), they assume distribution-independent worst-case complexity but real data has structure, domain-specific scaling exponents may differ from language models, and they do not account for inference cost, interpretability, or other downstream considerations.

The discovery of neural scaling laws is among the most important recent insights in deep learning, bridging empirical machine learning practice with theoretical understanding and enabling principled resource allocation for training increasingly capable models.

Practice Questions 4 questions