ArXiv HiL-Bench: Do AI agents know when to ask a human for help?

A research team has introduced HiL-Bench (Human-in-the-Loop Benchmark), the first systematic benchmark that tests one of the most important yet often overlooked capabilities of AI agents — whether they can recognize when they lack information and should ask a human for help.

The problem of confident guessing

Today’s AI agents are designed to be helpful and effective. But this bias toward action has a dark side — agents often continue executing tasks even when they lack sufficient information, preferring to guess rather than admit uncertainty. In critical applications such as medicine, finance, or legal systems, this can have serious consequences.

What HiL-Bench reveals

The benchmark places agents in situations where some tasks require additional information from the user for correct resolution. The key question is: will the agent recognize this need and ask for help, or will it proceed on its own?

The results are sobering — even frontier models show a low ability to recognize the limits of their own knowledge. Agents consistently overestimate their capabilities and rarely ask for clarification. However, the researchers found that targeted training significantly improves this skill, suggesting the problem is solvable.

Implications for the industry

As AI agents are increasingly used in autonomous scenarios, the ability to recognize one’s own limitations becomes a critical safety feature. HiL-Bench provides a standardized way to measure this aspect that should become part of every evaluation of agentic systems.

ArXiv HiL-Bench: Do AI agents know when to ask a human for help?

The problem of confident guessing

What HiL-Bench reveals

Implications for the industry

Sources

Related news