arXiv: CoDaPO — adaptive RL optimization

A new paper identifies three recurring dynamics in RL training of reasoning models and proposes CoDaPO, a method that weights questions by confidence and difficulty. By prioritizing learnable questions it achieves consistent improvements across 12 benchmarks.

arXiv published a paper on 6 June 2026 (label arXiv:2606.07950, version v1, 02:51 UTC) that presents CoDaPO, a confidence/difficulty-adaptive policy optimization method for training reasoning models. The paper starts from an analysis of recurring problems in RL training (reinforcement learning).

Which dynamics arise in RL training?

The authors identify three recurring dynamics in reinforcement learning training. The first is confidence inflation, when the model becomes increasingly certain of its answers regardless of actual accuracy.

The second is advantage contraction, where the differences in usefulness among individual examples shrink and make learning harder. The third is hierarchical convergence, a pattern in which the model converges in layers. Together these three dynamics explain why standard RL training spends compute inefficiently.

What is CoDaPO and how does it work?

In response to these problems, the paper proposes CoDaPO. The method assigns importance to questions based on rollout confidence (confidence during answer generation) and the empirical difficulty of each question.

Based on that estimate, CoDaPO then reweights the policy updates, that is, the policy update steps. This steers training toward those examples that contribute most to learning, instead of treating all examples equally.

Why are learnable questions the focus?

The key idea is prioritizing “learnable” questions within a fixed compute budget. These are questions that are neither overly easy nor unsolvable, but precisely those from which the model can learn the most.

By distinguishing questions the model already solves confidently from those that remain challenging, CoDaPO avoids wasting resources on examples that bring no progress. The same budget of computational resources is thus used far more purposefully.

How much improvement does the method bring?

According to the paper, CoDaPO achieves consistent improvements across 12 benchmarks over existing RL methods. This is steady progress across a broad set of tasks, not an isolated result on a single test.

The goal of the method is more efficient allocation of compute by distinguishing questions the model already solves from those that remain hard. It is precisely this targeted distribution of resources that lies behind the recorded improvements.

Why is this approach significant?

The paper is interesting because it frames the problem of training reasoning models as a question of allocating attention, rather than simply a larger amount of resources. Instead of simply increasing compute, CoDaPO directs it more intelligently.

This opens the way to more efficient training of models under limited budgets. For researchers working with fixed resources, such an adaptive approach can mean a better result at no extra cost.

Frequently Asked Questions

Which three RL dynamics does the paper identify?

The paper identifies three recurring dynamics in RL training: confidence inflation, advantage contraction, and hierarchical convergence. These dynamics describe recurring patterns that arise during the training of reasoning models.

How does CoDaPO assign importance to questions?

CoDaPO assigns importance to questions based on rollout confidence (confidence during generation) and empirical difficulty, and then reweights the policy updates. The goal is to prioritize learnable questions within a fixed compute budget.

How much improvement does the method bring?

CoDaPO achieves consistent improvements across 12 benchmarks over existing RL methods. The improvements come from a more efficient allocation of compute that distinguishes questions the model already solves from those that remain challenging.

arXiv:2606.07950: CoDaPO — confidence/difficulty-adaptive RL optimization for reasoning

Which dynamics arise in RL training?

What is CoDaPO and how does it work?

Why are learnable questions the focus?

How much improvement does the method bring?

Why is this approach significant?

Frequently Asked Questions

Sources

Related news