🤖 24 AI
🟡 🤖 Models Monday, April 13, 2026 · 1 min read

ArXiv PRA: 4B model achieves 80.8% on medical benchmark — new SOTA for small scale

Why it matters

Process Reward Agents enable small frozen models (0.5B-8B) to significantly improve medical reasoning without any training — Qwen3-4B achieves a new state-of-the-art of 80.8% on MedQA.

The verification problem in medicine

Reasoning in medical domains is particularly difficult because intermediate steps cannot be locally verified — unlike mathematics or code, checking the correctness of a step requires synthesizing information from large external knowledge bases. Subtle errors can propagate through the chain of thought and never be detected.

What are Process Reward Agents

The team (Sohn, Sternal, Styppa, Hoefler, Moor) introduces Process Reward Agents (PRA) — a method for providing domain-grounded, online, step-by-step rewards to a frozen model at test time.

Unlike previous Process Reward Models that evaluate completed trajectories post-hoc, PRA enables search-based decoding that ranks and prunes candidate trajectories at each generation step.

Results

  • 80.8% accuracy on MedQA with Qwen3-4B — new state-of-the-art at the 4B parameter scale
  • Generalizes to unseen frozen models from 0.5B to 8B parameters
  • Accuracy improvement of up to 25.7% without any model updates

A new paradigm

PRA proposes a paradigm in which frozen reasoners are separated from domain-specific reward modules. This enables deployment of new backbone models in complex domains without the need for retraining — significant for medicine where model recertification is expensive and time-consuming.

🤖 This article was generated using artificial intelligence from primary sources.