πŸ€– 24 AI
πŸ”΄ 🀝 Agents Monday, April 13, 2026 Β· 2 min read

ArXiv HiL-Bench: no frontier model knows when to ask for help

Why it matters

A new benchmark reveals a universal judgment deficiency in AI agents β€” when specifications are incomplete, no frontier model achieves more than a fraction of its full performance. Researchers show this skill can be trained with RL.

Universal judgment problem

A team of researchers (Elfeki, Trinh, Luu et al.) presented HiL-Bench (Human-in-the-Loop Benchmark) β€” the first benchmark that specifically measures whether AI agents can recognize when they need to ask a human for help instead of guessing.

Existing benchmarks give agents complete, unambiguous instructions and only measure execution accuracy. HiL-Bench does the opposite: each task contains validated blockers β€” missing information, ambiguous requirements, or contradictory specifications β€” that are only revealed through progressive exploration, not upfront.

No frontier model passes

Evaluation across SWE and text-to-SQL domains reveals a large universal judgment gap: no frontier model achieves more than a fraction of its performance when it must decide on its own whether to ask for clarification.

The new Ask-F1 metric (harmonic mean of question precision and blocker recall) architecturally prevents gaming through question spamming.

Three failure patterns

The analysis identifies three systematic patterns:

  1. Overconfident false beliefs β€” the agent does not detect the information gap
  2. High uncertainty detection but persistent errors β€” the agent recognizes the problem but does not escalate
  3. Broad, imprecise escalation β€” the agent asks too generally, without self-correction

Judgment can be trained

Key finding: RL training on Ask-F1 reward signal improves judgment. A 32B model after training improves both question quality and task pass rate β€” with cross-domain transfer. The model does not learn domain-specific heuristics, but rather learns to detect irresolvable uncertainty and act on it.

For anyone using AI agents in production, this is a warning: agents that appear competent on complete specifications can fail catastrophically when information is missing β€” and that is the norm in the real world.

πŸ€– This article was generated using artificial intelligence from primary sources.