Scaling laws

Empirical power-law relationships linking model size, training data, and compute to performance; the foundation for planning how large models are trained.

Scaling laws are empirical power-law relationships describing how a model’s loss falls smoothly as three quantities grow: the number of parameters, the amount of training data, and the compute invested in training. Kaplan et al. (2020) showed that the performance of large language models follows predictable curves across many orders of magnitude.

A typical form is L(N, D) = E + A·N^−α + B·D^−β, where N is the parameter count and D is the number of tokens. These laws let researchers estimate a large model’s quality from smaller, cheaper experiments before committing an enormous compute budget.

Hoffmann et al. (2022, “Chinchilla”) revised the rule: for a fixed budget, parameters and data should be scaled in equal proportion, since earlier models had been undertrained. Scaling laws now underpin the planning of foundation models, and the 2025-2026 debate is shifting toward frontier models and scaling compute at inference time (test-time compute).

Scaling laws

Sources

See also