arXiv:2605.06638: ScaleLogic — RL power law in reasoning depth

ScaleLogic is a synthetic framework showing that the reinforcement learning compute required for long-horizon reasoning follows a power law with depth: T ∝ D^γ (R² > 0.99). The exponent γ ranges from 1.04 to 2.60 depending on logical expressiveness, and more expressive training yields up to +10.66 points better downstream results.

Tianle Wang, Zhaoyang Wang, Guangchen Lan and colleagues published on arXiv on 7 May the ScaleLogic study — a synthetic framework that systematically reveals how reinforcement learning shapes long-horizon reasoning in large language models.

How does ScaleLogic control the experiment?

ScaleLogic is a logical reasoning task generator that allows independent control of two axes: reasoning depth (number of steps in a proof) and logical expressiveness (simple implication, propositional logic, first-order logic with conjunction, disjunction, negation, and quantifiers). This is a rarity — most benchmarks vary both variables simultaneously, making findings uninterpretable.

By controlling the axes independently, the authors isolate the influence of each on the required amount of RL training.

What is the main quantitative finding?

Training compute follows a power law in reasoning depth:

T ∝ D^γ, where R² > 0.99

The exponent γ grows monotonically with logical expressiveness, from 1.04 for the simplest systems to 2.60 for first-order logic. In other words, tasks that are twice as deep in more expressive logics require up to 6× more RL compute — the relationship is predictable and replicates across different RL methods.

What does this change in training practice?

The most practical finding: models trained on more expressive synthetic settings transfer knowledge more than 10.66 points better on downstream benchmarks and achieve higher transfer learning efficiency, even when the total amount of training is equal. Curriculum learning — training from simple to more complex logics — further improves scaling efficiency.

The implication is clear: the quality of synthetic data for RL is a lever just as powerful as raw compute. What a model trains on shapes its reasoning ability as much as how much it trains.

Frequently Asked Questions

What is ScaleLogic?

ScaleLogic is a synthetic logical reasoning environment that allows independent control of task depth (proof horizon) and logical expressiveness (from simple implication to first-order logic with quantifiers).

What does the power law in depth mean?

T ∝ D^γ means that the required RL compute T grows as a power of task depth D. The exponent γ ranges from 1.04 (simple logic) to 2.60 (expressive logic) — longer tasks require nonlinearly more resources.

Why is logical expressiveness key?

More expressive logical settings produce models that transfer knowledge better to new tasks (up to +10.66 points) and use compute more efficiently in transfer learning. What a model trains on matters as much as how much it trains.

arXiv:2605.06638: ScaleLogic — RL compute follows a power law in reasoning depth

How does ScaleLogic control the experiment?

What is the main quantitative finding?

What does this change in training practice?

Frequently Asked Questions

Sources

Related news