🟢 ✨ Curiosities Published: · 4 min read ·

Two AI Metrics Diverged — Will It Make All the Difference?

Editorial illustration: Metric choice determines conclusions about AI democratization versus concentration

Researchers from MIT and Northwestern prove that the conclusion about AI capability democratization or concentration depends entirely on which benchmark you use — not on the actual state of the technology.

🤖

This article was generated using artificial intelligence from primary sources.

Will the most powerful AI systems remain reserved for wealthy corporations and governments, or will they eventually become accessible to everyone? That is one of the fundamental policy questions of modern AI development — and according to new research, the answer is neither “yes” nor “no.” The answer is: it depends on what you measure.

The paper “Two AI Metrics Diverged: Will it Make All the Difference?” by Alex Fogelison, Zachary Brown, Hans Gundlach, Jayson Lynch, and Neil Thompson from MIT and Northwestern University, accepted at the ICML 2026 Technical AI Governance Research Workshop, offers a mathematical analysis with far-reaching implications for regulators, researchers, and anyone trying to predict the future of AI.

Can the same technology simultaneously democratize and concentrate capabilities?

The researchers’ answer is unambiguous: it can, and that is exactly what is happening — depending on which benchmark you look at.

Validation loss, the standard measure of model error used daily in AI research, shows convergence between smaller and larger models as compute resources grow. Smaller models catch up with larger ones. This is a signal that favors the democratization narrative — the argument that advanced AI will become increasingly accessible to a broader range of actors.

However, other sets of capability measures — benchmarks that test concrete tasks such as programming, reasoning, or persuasive writing — show divergence. Frontier models, developed in large laboratories with billions of dollars of compute, are not merely keeping pace with smaller models but increasingly extending their lead.

Both findings hold simultaneously. The paradox is not accidental — it follows from the mathematical structure of the metrics themselves.

A taxonomy of metrics: bounded versus unbounded

The central contribution of the paper is a formal mathematical taxonomy of measurement tools by their functional form relative to compute cost.

The authors prove that bounded metrics — those with a mathematical upper ceiling — consistently favor accessibility. When large models approach the maximum value, smaller models can follow with dramatically fewer resources. Validation loss is such a measure.

On the other hand, unbounded metrics — those that can grow without an upper limit — favor concentration among actors with vast resources. While one model scores 100 on a given benchmark, another with more resources can score 1,000 or 10,000. The gap does not narrow; it grows.

This is not a purely theoretical curiosity. The choice of benchmark in evaluation reports, regulatory proposals, and public studies directly determines which conclusion you reach — even when looking at the same models on the same tasks.

Policy implications: the debate is partly an artifact of measurement

The researchers specifically highlight domains such as software engineering, synthetic biology, and rhetorical persuasiveness as examples where the same frontier model progress can look like democratization or concentration — depending on whether the relevant capability in that domain is mathematically bounded or not.

This has direct implications for regulators drafting policy based on “whether AI capability is accessible to small actors.” Use a bounded metric and you will conclude that it is. Use an unbounded metric and you will conclude the opposite.

Debates about AI democratization versus concentration are partly an artifact of the measurement instrument, not a reflection of the actual state of the technology.

The paper calls on the research community to explicitly identify the functional form of the metrics used when drawing policy conclusions — and to be aware that a benchmark suitable for comparing models within a laboratory may not be suitable for predicting the social outcomes of AI development.

For researchers and policy makers tracking AI regulation, this is an argument that no single benchmark should be used as the sole indicator when making decisions about accessibility or concentration of capabilities — because behind every such conclusion lies a mathematical assumption that may be entirely at odds with intuition.

Frequently Asked Questions

What is validation loss and why does it matter for this debate?
Validation loss is a standard measure of model error during training. According to this paper, validation loss shows convergence between smaller and larger models — which would suggest democratization of AI capabilities.
Why do bounded metrics favor accessibility?
Bounded metrics have an upper ceiling on achievable values. When large models approach that ceiling, smaller models can follow with far fewer resources — a mathematical condition the authors formally prove.
Which domains can have opposing conclusions depending on metric choice?
The authors cite software engineering, synthetic biology, and rhetorical persuasiveness as examples where the choice of bounded versus unbounded metric can lead to completely opposite policy conclusions.