Neural scaling laws describe how neural network performance improves predictably with three factors: model size (parameters), training data size (samples), and compute budget (FLOPs). Empirically, performance follows power-law relationships: loss scales as O(N^{-alpha}) where N is the factor being scaled and alpha is typically 0.07-0.1. These laws are striking because they hold across diverse architectures (transformers, CNNs, RNNs), domains (vision, language, multimodal), and scales (millions to billions of parameters). Scaling laws enable predicting performance before training, allocating compute efficiently between model size and data, and understanding fundamental limits of deep learning.
Neural scaling laws, extensively documented by OpenAI researchers (particularly Kaplan et al. 2020, Hoffmann et al. 2022, and subsequent work), reveal that deep learning performance is not haphazard but follows predictable, mathematical relationships. The primary finding is that loss decreases as a power law in three factors: model size (N), data size (D), and compute budget (C).
The scaling laws are typically expressed as:
where alpha_N ≈ 0.07, alpha_D ≈ 0.10, alpha_C ≈ 0.16 (for language model pretraining). These exponents are remarkably consistent across different architectures and domains, suggesting they reflect fundamental properties of learning from data.
A key insight is the Chinchilla insight from Hoffmann et al. (2022), showing that optimal performance on a fixed compute budget comes from allocating roughly equal resources to model size and data diversity. This overturned previous practice of scaling model size much more aggressively than data size. The implication: don't train a model with 175B parameters on 300B tokens; instead, train a model with ~70B parameters on a larger and more diverse dataset. This principle has guided subsequent model development and explains why competitive models are increasingly data-efficient.
Theoretically, understanding scaling laws remains incomplete. Several frameworks provide partial explanations:
1. Statistical learning theory: Generalization bounds scale with model capacity and data size, consistent with power-law scaling in the overparameterized regime.
2. Renormalization group theory: Some researchers draw parallels to phase transitions and critical phenomena in physics, where observables scale as power laws near criticality.
3. Information-theoretic bounds: Bounds on mutual information between data and model parameters suggest power-law scaling of required samples.
4. Benign overfitting: In the overparameterized regime, models can achieve zero training error while generalizing, enabled by implicit regularization that suppresses memorization of noise.
However, none of these fully explains why the exponents are as large as they are (alpha_C ≈ 0.16 is relatively steep) or why they are so consistent across domains. The mechanism by which neural networks extract structure from data with such efficiency remains partially mysterious.
Practically, scaling laws enable several capabilities:
Limitations include: scaling laws may break down at very large scales (double descent or other regime changes), they assume distribution-independent worst-case complexity but real data has structure, domain-specific scaling exponents may differ from language models, and they do not account for inference cost, interpretability, or other downstream considerations.
The discovery of neural scaling laws is among the most important recent insights in deep learning, bridging empirical machine learning practice with theoretical understanding and enabling principled resource allocation for training increasingly capable models.
No topics depend on this one yet.