Monday, April 13, 2026

4 articles — 🔴 1 critical , 🟡 2 important , 🟢 1 interesting

🤖 Models (2)

🟡 🤖 Models April 13, 2026 · 1 min read

ArXiv PRA: 4B model achieves 80.8% on medical benchmark — new SOTA for small scale

Process Reward Agents enable small frozen models (0.5B-8B) to significantly improve medical reasoning without any training — Qwen3-4B achieves a new state-of-the-art of 80.8% on MedQA.

🟡 🤖 Models April 13, 2026 · 1 min read

ArXiv SPPO: Sequence-level PPO solves the credit assignment problem in long reasoning chains

Sequence-Level PPO reformulates LLM reasoning as a contextual bandit problem, achieving the performance of expensive group methods like GRPO with dramatically fewer resources — without multi-sampling.

🤝 Agents (2)

🔴 🤝 Agents April 13, 2026 · 2 min read

ArXiv HiL-Bench: no frontier model knows when to ask for help

A new benchmark reveals a universal judgment deficiency in AI agents — when specifications are incomplete, no frontier model achieves more than a fraction of its full performance. Researchers show this skill can be trained with RL.

🟢 🤝 Agents April 13, 2026 · 2 min read

ArXiv SAGE: 27 LLMs tested — models understand intent but don't execute correctly

A new benchmark for customer services reveals two phenomena: 'Execution Gap' (models correctly classify intents but don't perform the correct actions) and 'Empathy Resilience' (models remain polite while making logical errors).

← Previous day Next day →