ArXiv HiL-Bench: no frontier model knows when to ask for help
Why it matters
A new benchmark reveals a universal judgment deficiency in AI agents — when specifications are incomplete, no frontier model achieves more than a fraction of its full performance. Researchers show this skill can be trained with RL.
Universal judgment problem
A team of researchers (Elfeki, Trinh, Luu et al.) presented HiL-Bench (Human-in-the-Loop Benchmark) — the first benchmark that specifically measures whether AI agents can recognize when they need to ask a human for help instead of guessing.
Existing benchmarks give agents complete, unambiguous instructions and only measure execution accuracy. HiL-Bench does the opposite: each task contains validated blockers — missing information, ambiguous requirements, or contradictory specifications — that are only revealed through progressive exploration, not upfront.
No frontier model passes
Evaluation across SWE and text-to-SQL domains reveals a large universal judgment gap: no frontier model achieves more than a fraction of its performance when it must decide on its own whether to ask for clarification.
The new Ask-F1 metric (harmonic mean of question precision and blocker recall) architecturally prevents gaming through question spamming.
Three failure patterns
The analysis identifies three systematic patterns:
- Overconfident false beliefs — the agent does not detect the information gap
- High uncertainty detection but persistent errors — the agent recognizes the problem but does not escalate
- Broad, imprecise escalation — the agent asks too generally, without self-correction
Judgment can be trained
Key finding: RL training on Ask-F1 reward signal improves judgment. A 32B model after training improves both question quality and task pass rate — with cross-domain transfer. The model does not learn domain-specific heuristics, but rather learns to detect irresolvable uncertainty and act on it.
For anyone using AI agents in production, this is a warning: agents that appear competent on complete specifications can fail catastrophically when information is missing — and that is the norm in the real world.
This article was generated using artificial intelligence from primary sources.
Related news
arXiv:2604.21910: Agentic AI automates scientific workflow with 83% accuracy, 92% less data transfer and $0.001 per query
arXiv:2604.22748: Survey by 42 authors introduces 'levels × laws' taxonomy for world models in AI agents — synthesis of 400+ papers
arXiv:2604.22452: Superminds Test shows collective intelligence does not emerge spontaneously in a society of 2 million AI agents