🤖 24 AI
🟢 🤝 Agents Monday, April 13, 2026 · 2 min read

ArXiv SAGE: 27 LLMs tested — models understand intent but don't execute correctly

Why it matters

A new benchmark for customer services reveals two phenomena: 'Execution Gap' (models correctly classify intents but don't perform the correct actions) and 'Empathy Resilience' (models remain polite while making logical errors).

Benchmark for customer service agents

The team (Shi, Dai, Wang et al.) presents SAGE (Service Agent Graph-guided Evaluation) — a benchmark that formalizes unstructured SOPs (Standard Operating Procedures) into Dynamic Dialogue Graphs and tests how well LLMs follow them in practice.

SAGE tested 27 LLMs across 6 industrial scenarios — the largest evaluation of its kind for service agents.

Two key phenomena

Execution Gap

Models correctly classify the user’s intent (they know what the user wants), but don’t perform the correct next actions according to the SOP. Understanding does not equal execution.

Empathy Resilience

Under high adversarial pressure, models maintain a polite conversational facade while making logical errors beneath the surface. The user gets the impression that the agent is competent, when in reality it is doing the wrong things — a deceptive mode of failure.

Why it matters

For companies using AI agents in customer service, this is a warning: standard benchmarks that only measure “does the agent understand the question” miss a critical dimension — “does the agent do the right thing after understanding.” SAGE introduces an adversarial taxonomy of intents and a modular extension mechanism for testing in new domains at low cost.

🤖 This article was generated using artificial intelligence from primary sources.