AgentV-RL: Tool-Augmented Verifier Beats SOTA by 25.2%

AgentV-RL is a new framework for scaling reward modelling through an agentic verifier that uses multi-turn tool-augmented deliberation. Two complementary agents — forward (from premises to conclusion) and backward (from conclusion to premises) — validate reasoning. Through RL with proactive exploration, the 4B variant outperforms state-of-the-art outcome reward models by 25.2%.

Why a new approach to reward modelling?

Reward models are the foundation of RL training for LLMs — from RLHF to newer RLVR approaches. But classical outcome reward models (ORM) have a limitation: they evaluate only the final answer, without understanding the process. Process reward models (PRM) that track each reasoning step are better, but they are expensive to train and often overly strict.

Authors Jiazheng Zhang and colleagues, in an arXiv preprint from 17 April 2026, introduce AgentV-RL — a verifier that operates as an agent: multi-turn, tool-using, deliberating before issuing an assessment.

How does the agentic verifier work?

AgentV-RL uses two complementary agents:

Forward agent. Follows reasoning from premises to conclusion. For each step it checks: does this follow from the previous steps? Is it justified? If a fact is used, is that fact valid? The forward agent catches errors of the type “logical leap” or “unsubstantiated claim”.

Backward agent. Works in the opposite direction — from conclusion to premises. It asks: are the premises truly necessary? Is the conclusion genuinely a consequence, or was it predetermined? The backward agent catches errors of the type “reverse engineering” — where the model knows the result and fabricates a justification.

The two directions are not redundant — they catch different classes of errors that appear in different types of problems.

The role of tools and proactive exploration

AgentV-RL is not merely two LLMs — the verifier has access to tools:

Code executor — for verifying mathematical calculations or programming claims
Knowledge lookup — for facts that can be checked in an external knowledge base
Symbolic solver — for logical or algebraic inferences where a deterministic answer exists

Through RL with proactive exploration, the verifier learns when to use which tool — it does not invoke all tools every time, but selects based on the type of problem. This is the key difference from passive PRMs that only read text.

What are the results?

The most impressive figure from the abstract: the 4B AgentV-RL model outperforms SOTA outcome reward models by 25.2%. This is a substantial margin in a field where improvements are typically measured in single-digit percentages.

The authors also demonstrate test-time scaling — performance improves when the verifier is given more deliberation time. This is practically important because it means costs scale with problem complexity — simple cases finish quickly, difficult ones receive more reasoning cycles.

Implications for RL training

For teams training LLMs via RL (RLHF, RLVR, DPO-style), the message is that the verification component can be agentic, not just a static model. This opens the door to:

Better process reward modelling for mathematics, code, and reasoning tasks
Tool-augmented training signals — signals from code execution are deterministic, reducing noise in the RL loop
Reduced reward hacking — an agentic verifier with forward+backward agents and tools is harder to fool than a plain ORM that only reads text

The paper is directly relevant to the current wave of RLVR research (RL with verifiable rewards) because it shows that the quality of the verifier dramatically changes training outcomes. Combined with earlier criticism from the RLVR Gaming Verifiers study (19 April), AgentV-RL can be seen as an answer — how to build a verifier that is harder to game.

Frequently Asked Questions

What does the forward agent do, and what does the backward agent do?

The forward agent follows reasoning from premises to conclusion — it checks whether each step is justified based on the previous ones, and verifies factual claims. The backward agent works in reverse — it checks whether the conclusion genuinely follows from the stated premises, or whether the premises were selected post-hoc to justify a predetermined answer. The two directions catch different classes of errors.

Why use tool-use in reward modelling?

A classical reward model only reads text and assigns a score. A tool-augmented verifier can execute code, look up facts in a knowledge base, or run a symbolic solver — concretely checking claims rather than assessing them purely probabilistically. For mathematical or programming problems the difference is significant, because a tool can provide a deterministic answer.

What does '4B model outperforms SOTA by 25.2%' mean?

The authors compared AgentV-RL in its 4B-parameter variant against the best outcome reward models (which typically evaluate only the final answer without deliberation). On the reward modelling benchmark, AgentV-RL achieves a 25.2% higher gain — meaning the verification is more precise and correlates better with solution correctness.

AgentV-RL introduces a tool-augmented verifier with forward and backward agents — 4B model outperforms SOTA reward model by 25.2%