arXiv:2606.20008: VIMPO — Critic-Free Reinforcement Learning Beats GRPO on MATH-500 and AIME
VIMPO is a new reinforcement learning method for LLM reasoning that derives an implicit value function from KL-regularized RL — without a separate critic network. It outperforms GRPO on four mathematical benchmarks including AIME 2024 and AIME 2025, with gains that remain stable even under noisy reward conditions.
This article was generated using artificial intelligence from primary sources.
What Is VIMPO and Why Does It Matter
VIMPO (Value-Implicit Policy Optimization) is a reinforcement learning (RL) method for training LLM models on reasoning tasks. It was developed by researchers at UC Berkeley (Zhewei Kang, Aosong Feng, Sergey Levine, Dawn Song, Xuandong Zhao) and published on June 19, 2026.
The starting problem: the popular method GRPO suffers from weak credit assignment — it does not distinguish well which step in a reasoning chain contributed to the correct answer. The standard solution is to add a separate critic network, but this complicates training and increases costs.
How It Works: The Implicit Value Function
VIMPO does not train a critic network. Instead, it mathematically derives a value function from the optimality conditions of KL-regularized RL — a function that is implicitly contained in the policy itself. This provides a credit assignment signal without any additional component.
The result is a method that retains the practical simplicity of critic-free training (similar to GRPO) while correcting its fundamental shortcoming.
Results: Consistently Better Than GRPO
VIMPO outperforms GRPO on all four tested benchmarks:
- MATH-500 — standard mathematical benchmark
- AIME 2024 and AIME 2025 — challenging competition mathematics
- OlympiadBench — olympiad-level problems
The improvements are consistent and remain stable even with noisy reward signals — an important property in real-world applications where automated grading is imperfect.
Significance for Reasoning Model Development
VIMPO offers a practical path to better RL training for reasoning models without the architectural complexity of dual-network systems. The method is particularly relevant for research groups working with limited computational resources, as it eliminates the need for parallel training of a critic component.
Frequently Asked Questions
- How does VIMPO differ from GRPO?
- GRPO suffers from a credit assignment problem because it treats all steps in a reasoning chain equally. VIMPO addresses this by introducing a value-implicit value function derived directly from the optimality conditions of KL-regularized RL — without training a separate critic network.
- On which benchmarks was VIMPO tested?
- On four mathematical benchmarks: MATH-500, AIME 2024, AIME 2025, and OlympiadBench. It shows consistently better results than GRPO on all of them, including scenarios with noisy reward signals.
- Who is behind VIMPO?
- The authors are Zhewei Kang, Aosong Feng, Sergey Levine, Dawn Song, and Xuandong Zhao from UC Berkeley. The paper was submitted on June 18 and published on June 19, 2026.