ArXiv HiL-Bench: no frontier model knows when to ask for help

A new benchmark reveals a universal judgment deficiency in AI agents — when specifications are incomplete, no frontier model achieves more than a fraction of its full performance. Researchers show this skill can be trained with RL.

Universal judgment problem

A team of researchers (Elfeki, Trinh, Luu et al.) presented HiL-Bench (Human-in-the-Loop Benchmark) — the first benchmark that specifically measures whether AI agents can recognize when they need to ask a human for help instead of guessing.

Existing benchmarks give agents complete, unambiguous instructions and only measure execution accuracy. HiL-Bench does the opposite: each task contains validated blockers — missing information, ambiguous requirements, or contradictory specifications — that are only revealed through progressive exploration, not upfront.

No frontier model passes

Evaluation across SWE and text-to-SQL domains reveals a large universal judgment gap: no frontier model achieves more than a fraction of its performance when it must decide on its own whether to ask for clarification.

The new Ask-F1 metric (harmonic mean of question precision and blocker recall) architecturally prevents gaming through question spamming.

Three failure patterns

The analysis identifies three systematic patterns:

Overconfident false beliefs — the agent does not detect the information gap
High uncertainty detection but persistent errors — the agent recognizes the problem but does not escalate
Broad, imprecise escalation — the agent asks too generally, without self-correction

Judgment can be trained

Key finding: RL training on Ask-F1 reward signal improves judgment. A 32B model after training improves both question quality and task pass rate — with cross-domain transfer. The model does not learn domain-specific heuristics, but rather learns to detect irresolvable uncertainty and act on it.

For anyone using AI agents in production, this is a warning: agents that appear competent on complete specifications can fail catastrophically when information is missing — and that is the norm in the real world.

ArXiv HiL-Bench: no frontier model knows when to ask for help

Universal judgment problem

No frontier model passes

Three failure patterns

Judgment can be trained

Sources

Related news