arXiv:2605.15177 OpenDeepThink: parallel reasoning via Bradley-Terry aggregation lifts Gemini 3.1 Pro by +405 Elo on Codeforces
OpenDeepThink is a new population-based test-time compute scaling methodology published May 14, 2026 on arXiv by Shang Zhou and collaborators. The framework samples multiple reasoning candidates in parallel and selects the best through pairwise Bradley-Terry comparisons, instead of pointwise LLM judging. Result: Gemini 3.1 Pro gains +405 Elo on Codeforces benchmarks across eight sequential LLM-call rounds (~27 minutes). The team also released the CF-73 dataset with 73 expert-rated Codeforces problems.
This article was generated using artificial intelligence from primary sources.
Shang Zhou, Wenhao Chai, Kaiyuan Liu, Huanzhi Mao, Qiuyang Mang, and Jingbo Shang published a paper on May 14, 2026 that addresses one of the most familiar problems in parallel reasoning scaling: how to reliably select the best answer among parallel candidates without a ground-truth verifier.
What is the selection bottleneck in parallel reasoning?
Test-time compute scaling increasingly relies on parallel sampling — the model generates N candidates and the system selects the best. The problem is selection: without a ground-truth verifier, pointwise LLM judging is “noisy and biased” — the model is not reliable at evaluating its own output. The solution OpenDeepThink proposes is a different approach: pairwise comparison using Bradley-Terry aggregation.
How does the Bradley-Terry generational loop work?
The system operates generationally across eight steps:
- Random pairing — LLM judges random pairs of candidates
- Bradley-Terry aggregation — votes are transformed into a global ranking using the statistical Bradley-Terry model
- Selection — top-ranked candidates are retained
- Mutation — the top three-quarters are modified through natural-language critique derived from comparisons
- Discard — the bottom quarter is eliminated
- Loop repeats across 8 sequential rounds (~27 minutes)
The approach is inspired by evolutionary algorithms — a population persists across generations, but instead of a biological fitness function it uses LLM-based pairwise preference learning.
What numbers does the paper concretely demonstrate?
The most important metric: on Codeforces benchmarks, OpenDeepThink raised Gemini 3.1 Pro’s effective Elo rating by +405 points across 8 sequential LLM-call rounds (~27 minutes). +405 Elo is a dramatic shift — it transforms a grandmaster-level Gemini into a category that competes with the world’s top human competitors.
On the multi-domain HLE benchmark, gains are concentrated in objectively verifiable domains (math, programming), but a reversed tendency emerged in subjective domains (creative writing, opinions) — suggesting Bradley-Terry only works where there is a clear signal of the better answer.
What does the CF-73 dataset contribute?
The team released CF-73 — a curated dataset of 73 expert-rated Codeforces problems with Grandmaster annotations. CF-73 serves as a public evaluation resource for future reasoning research and helps standardize measurement protocols in a domain where benchmarks quickly become outdated.
The framework transfers across model variants without retuning — making it a “model-agnostic” addition to any frontier reasoning system. The approach directly competes with SU-01 (arXiv:2605.13301, May 13) gold-medal Olympiad reasoning, but from a different direction: SU-01 trains a specialized model, OpenDeepThink uses a general-purpose LLM with a smarter inference loop.
Frequently Asked Questions
- What is Bradley-Terry aggregation in the context of parallel reasoning?
- Bradley-Terry is a statistical model for pairwise comparisons; OpenDeepThink uses it instead of pointwise LLM judging — the LLM judges pairs of candidates, votes are aggregated into a global ranking, and top candidates are retained and mutated through natural-language critique.
- What is the CF-73 dataset?
- CF-73 is a curated dataset of 73 expert-rated Codeforces problems with Grandmaster annotations, published by the OpenDeepThink team as a public evaluation resource for future reasoning research.
Related news
arXiv:2605.21006: Off-the-shelf persona vectors achieve 68-98% effectiveness of targeted sycophancy steering in LLM models
Black Forest Labs: FLUX Erase outperforms GPT Image-2 (68.5%) and Finegrain (63.2%) in prompt-free object removal
arXiv:2605.19762: ICML 2026 paper claims code does not improve LLM mathematical reasoning