arXiv OpenDeepThink: +405 Elo on Codeforces

OpenDeepThink is a new population-based test-time compute scaling methodology published May 14, 2026 on arXiv by Shang Zhou and collaborators. The framework samples multiple reasoning candidates in parallel and selects the best through pairwise Bradley-Terry comparisons, instead of pointwise LLM judging. Result: Gemini 3.1 Pro gains +405 Elo on Codeforces benchmarks across eight sequential LLM-call rounds (~27 minutes). The team also released the CF-73 dataset with 73 expert-rated Codeforces problems.

Shang Zhou, Wenhao Chai, Kaiyuan Liu, Huanzhi Mao, Qiuyang Mang, and Jingbo Shang published a paper on May 14, 2026 that addresses one of the most familiar problems in parallel reasoning scaling: how to reliably select the best answer among parallel candidates without a ground-truth verifier.

What is the selection bottleneck in parallel reasoning?

Test-time compute scaling increasingly relies on parallel sampling — the model generates N candidates and the system selects the best. The problem is selection: without a ground-truth verifier, pointwise LLM judging is “noisy and biased” — the model is not reliable at evaluating its own output. The solution OpenDeepThink proposes is a different approach: pairwise comparison using Bradley-Terry aggregation.

How does the Bradley-Terry generational loop work?

The system operates generationally across eight steps:

Random pairing — LLM judges random pairs of candidates
Bradley-Terry aggregation — votes are transformed into a global ranking using the statistical Bradley-Terry model
Selection — top-ranked candidates are retained
Mutation — the top three-quarters are modified through natural-language critique derived from comparisons
Discard — the bottom quarter is eliminated
Loop repeats across 8 sequential rounds (~27 minutes)

The approach is inspired by evolutionary algorithms — a population persists across generations, but instead of a biological fitness function it uses LLM-based pairwise preference learning.

What numbers does the paper concretely demonstrate?

The most important metric: on Codeforces benchmarks, OpenDeepThink raised Gemini 3.1 Pro’s effective Elo rating by +405 points across 8 sequential LLM-call rounds (~27 minutes). +405 Elo is a dramatic shift — it transforms a grandmaster-level Gemini into a category that competes with the world’s top human competitors.

On the multi-domain HLE benchmark, gains are concentrated in objectively verifiable domains (math, programming), but a reversed tendency emerged in subjective domains (creative writing, opinions) — suggesting Bradley-Terry only works where there is a clear signal of the better answer.

What does the CF-73 dataset contribute?

The team released CF-73 — a curated dataset of 73 expert-rated Codeforces problems with Grandmaster annotations. CF-73 serves as a public evaluation resource for future reasoning research and helps standardize measurement protocols in a domain where benchmarks quickly become outdated.

The framework transfers across model variants without retuning — making it a “model-agnostic” addition to any frontier reasoning system. The approach directly competes with SU-01 (arXiv:2605.13301, May 13) gold-medal Olympiad reasoning, but from a different direction: SU-01 trains a specialized model, OpenDeepThink uses a general-purpose LLM with a smarter inference loop.

Frequently Asked Questions

What is Bradley-Terry aggregation in the context of parallel reasoning?

Bradley-Terry is a statistical model for pairwise comparisons; OpenDeepThink uses it instead of pointwise LLM judging — the LLM judges pairs of candidates, votes are aggregated into a global ranking, and top candidates are retained and mutated through natural-language critique.

What is the CF-73 dataset?

CF-73 is a curated dataset of 73 expert-rated Codeforces problems with Grandmaster annotations, published by the OpenDeepThink team as a public evaluation resource for future reasoning research.

arXiv:2605.15177 OpenDeepThink: parallel reasoning via Bradley-Terry aggregation lifts Gemini 3.1 Pro by +405 Elo on Codeforces

What is the selection bottleneck in parallel reasoning?

How does the Bradley-Terry generational loop work?

What numbers does the paper concretely demonstrate?

What does the CF-73 dataset contribute?

Frequently Asked Questions

Sources

Related news