ArXiv: LLM agents misjudge when they need tools

Researchers from Max Planck Institute for Software Systems and collaborators published a framework evaluating tool-calling decisions of LLM agents across three dimensions: necessity, benefit, and cost acceptability. Experiments on six models and three tasks reveal a significant gap between what the model thinks it needs and what actually increases accuracy — directly affecting the cost and reliability of production agents.

Qinyuan Wu and collaborators from Max Planck Institute for Software Systems, Imperial College, and Helmholtz Munich published on May 1, 2026 a framework that directly addresses one of the most costly problems in production AI agents: when a model should call an external tool, and when it should not. The paper’s title — “To Call or Not to Call” — encapsulates a dilemma that in practice costs seconds of latency, dollars in API costs, and percentage points of accuracy.

The foundational premise is that tool calls are not always beneficial; some are redundant, some even harmful. Web search can introduce noisy information that confuses the model; a calculator may be called for arithmetic the model already knows; a database may return irrelevant rows that overload the context.

What are the three dimensions of tool-call evaluation?

The framework measures each potential tool call across three orthogonal dimensions: necessity (is the tool required for the task at all?), benefit (does it improve the outcome when used?), and cost acceptability (is the added latency and cost justified?). All three must be positive for a call to be rational.

The distinction is subtle but crucial: a tool may be necessary (the task objectively requires it) but non-beneficial (the model achieves accuracy without it anyway), or it may be beneficial (it improves accuracy) but unavailable due to cost in a real-time scenario.

How do the authors compare model self-assessment to reality?

The approach combines two perspectives. The normative assessment comes from ground truth: for task X, which tool calls would be optimally made? The descriptive comes from model behavior: which calls does the model think it needs?

The gap between them reveals a systematic bias. Models frequently call tools that do not help (web search is the main culprit), and sometimes skip tools that would help. In other words, self-assessment is not a reliable signal.

Lightweight estimators from hidden states

The main technical contribution is that the authors train lightweight estimators that predict necessity and benefit from the model’s own hidden states — without additional API calls. These estimators become the basis of a controller that decides whether a tool call is needed, independently of what the model “thinks.”

Experiments across three tasks and six models show that controllers consistently outperform model self-assessment on the combined accuracy-and-cost metric.

What does this mean for AI engineers?

For teams building agents with LangGraph, AutoGen, or Anthropic’s tool use, the paper validates a common intuition: don’t let the model decide whether it needs a tool — add a gating layer. A practice that was previously a heuristic now has a formal framework and empirical results.

The broader implications also hit agent economics: if a production system can avoid 20–30% of unhelpful tool calls, at a scale of one million requests per day that represents measurable thousands of dollars per month in API savings alone.

Frequently Asked Questions

What does the 'To Call or Not to Call' framework investigate?

The framework investigates when LLM agents should and should not call external tools (web search, calculator, database). It distinguishes the model's self-assessment ('I think I need it') from the actual benefit of a tool call on answer accuracy.

What are the three dimensions of tool-call evaluation?

Necessity (is the tool needed at all?), benefit (does it improve the outcome?), and cost acceptability (does the latency/cost justify the gain?). All three must be positive for a tool call to be rational.

What did the authors find about model self-assessment?

There is a significant mismatch between what the model considers useful and what actually improves accuracy. Models sometimes call tools that do not help (web search introducing noisy information is the main culprit), and sometimes skip tools that would help.

ArXiv 'To Call or Not to Call' framework reveals LLMs misjudge when they need external tools

What are the three dimensions of tool-call evaluation?

How do the authors compare model self-assessment to reality?

Lightweight estimators from hidden states

What does this mean for AI engineers?

Frequently Asked Questions

Sources

Related news