ArXiv PRA: 4B model achieves 80.8% on medical benchmark — new SOTA for small scale

The verification problem in medicine

Reasoning in medical domains is particularly difficult because intermediate steps cannot be locally verified — unlike mathematics or code, checking the correctness of a step requires synthesizing information from large external knowledge bases. Subtle errors can propagate through the chain of thought and never be detected.

What are Process Reward Agents

The team (Sohn, Sternal, Styppa, Hoefler, Moor) introduces Process Reward Agents (PRA) — a method for providing domain-grounded, online, step-by-step rewards to a frozen model at test time.

Unlike previous Process Reward Models that evaluate completed trajectories post-hoc, PRA enables search-based decoding that ranks and prunes candidate trajectories at each generation step.

Results

80.8% accuracy on MedQA with Qwen3-4B — new state-of-the-art at the 4B parameter scale
Generalizes to unseen frozen models from 0.5B to 8B parameters
Accuracy improvement of up to 25.7% without any model updates

A new paradigm

PRA proposes a paradigm in which frozen reasoners are separated from domain-specific reward modules. This enables deployment of new backbone models in complex domains without the need for retraining — significant for medicine where model recertification is expensive and time-consuming.

ArXiv PRA: 4B model achieves 80.8% on medical benchmark — new SOTA for small scale

The verification problem in medicine

What are Process Reward Agents

Results

A new paradigm

Sources

Related news