ArXiv HiL-Bench: no frontier model knows when to ask for help
Why it matters
A new benchmark reveals a universal judgment deficiency in AI agents β when specifications are incomplete, no frontier model achieves more than a fraction of its full performance. Researchers show this skill can be trained with RL.
Universal judgment problem
A team of researchers (Elfeki, Trinh, Luu et al.) presented HiL-Bench (Human-in-the-Loop Benchmark) β the first benchmark that specifically measures whether AI agents can recognize when they need to ask a human for help instead of guessing.
Existing benchmarks give agents complete, unambiguous instructions and only measure execution accuracy. HiL-Bench does the opposite: each task contains validated blockers β missing information, ambiguous requirements, or contradictory specifications β that are only revealed through progressive exploration, not upfront.
No frontier model passes
Evaluation across SWE and text-to-SQL domains reveals a large universal judgment gap: no frontier model achieves more than a fraction of its performance when it must decide on its own whether to ask for clarification.
The new Ask-F1 metric (harmonic mean of question precision and blocker recall) architecturally prevents gaming through question spamming.
Three failure patterns
The analysis identifies three systematic patterns:
- Overconfident false beliefs β the agent does not detect the information gap
- High uncertainty detection but persistent errors β the agent recognizes the problem but does not escalate
- Broad, imprecise escalation β the agent asks too generally, without self-correction
Judgment can be trained
Key finding: RL training on Ask-F1 reward signal improves judgment. A 32B model after training improves both question quality and task pass rate β with cross-domain transfer. The model does not learn domain-specific heuristics, but rather learns to detect irresolvable uncertainty and act on it.
For anyone using AI agents in production, this is a warning: agents that appear competent on complete specifications can fail catastrophically when information is missing β and that is the norm in the real world.