ArXiv PRA: 4B model achieves 80.8% on medical benchmark — new SOTA for small scale
Why it matters
Process Reward Agents enable small frozen models (0.5B-8B) to significantly improve medical reasoning without any training — Qwen3-4B achieves a new state-of-the-art of 80.8% on MedQA.
The verification problem in medicine
Reasoning in medical domains is particularly difficult because intermediate steps cannot be locally verified — unlike mathematics or code, checking the correctness of a step requires synthesizing information from large external knowledge bases. Subtle errors can propagate through the chain of thought and never be detected.
What are Process Reward Agents
The team (Sohn, Sternal, Styppa, Hoefler, Moor) introduces Process Reward Agents (PRA) — a method for providing domain-grounded, online, step-by-step rewards to a frozen model at test time.
Unlike previous Process Reward Models that evaluate completed trajectories post-hoc, PRA enables search-based decoding that ranks and prunes candidate trajectories at each generation step.
Results
- 80.8% accuracy on MedQA with Qwen3-4B — new state-of-the-art at the 4B parameter scale
- Generalizes to unseen frozen models from 0.5B to 8B parameters
- Accuracy improvement of up to 25.7% without any model updates
A new paradigm
PRA proposes a paradigm in which frozen reasoners are separated from domain-specific reward modules. This enables deployment of new backbone models in complex domains without the need for retraining — significant for medicine where model recertification is expensive and time-consuming.
Related news
ArXiv: Process Reward Agents — real-time feedback improves AI reasoning in medicine without retraining
ArXiv SPPO: Sequence-level PPO solves the credit assignment problem in long reasoning chains
ArXiv SUPERNOVA: reinforcement learning on natural instructions improves reasoning by 52.8%