ArXiv HiL-Bench: no frontier model knows when to ask for help
A new benchmark reveals a universal judgment deficiency in AI agents — when specifications are incomplete, no frontier model achieves more than a fraction of its full performance. Researchers show this skill can be trained with RL.
This article was generated using artificial intelligence from primary sources.
Universal judgment problem
A team of researchers (Elfeki, Trinh, Luu et al.) presented HiL-Bench (Human-in-the-Loop Benchmark) — the first benchmark that specifically measures whether AI agents can recognize when they need to ask a human for help instead of guessing.
Existing benchmarks give agents complete, unambiguous instructions and only measure execution accuracy. HiL-Bench does the opposite: each task contains validated blockers — missing information, ambiguous requirements, or contradictory specifications — that are only revealed through progressive exploration, not upfront.
No frontier model passes
Evaluation across SWE and text-to-SQL domains reveals a large universal judgment gap: no frontier model achieves more than a fraction of its performance when it must decide on its own whether to ask for clarification.
The new Ask-F1 metric (harmonic mean of question precision and blocker recall) architecturally prevents gaming through question spamming.
Three failure patterns
The analysis identifies three systematic patterns:
- Overconfident false beliefs — the agent does not detect the information gap
- High uncertainty detection but persistent errors — the agent recognizes the problem but does not escalate
- Broad, imprecise escalation — the agent asks too generally, without self-correction
Judgment can be trained
Key finding: RL training on Ask-F1 reward signal improves judgment. A 32B model after training improves both question quality and task pass rate — with cross-domain transfer. The model does not learn domain-specific heuristics, but rather learns to detect irresolvable uncertainty and act on it.
For anyone using AI agents in production, this is a warning: agents that appear competent on complete specifications can fail catastrophically when information is missing — and that is the norm in the real world.
Related news
arXiv:2605.22502: Compiling agentic workflows into LLM weights achieves near-frontier quality at 100× lower cost
arXiv:2605.22794: MOSS shows agents that self-improve by rewriting their own source code
arXiv:2605.22535: TerminalWorld benchmark measures LLM agents on real Linux terminal tasks without simulation