ArXiv SAGE: 27 LLMs tested — models understand intent but don't execute correctly
Why it matters
A new benchmark for customer services reveals two phenomena: 'Execution Gap' (models correctly classify intents but don't perform the correct actions) and 'Empathy Resilience' (models remain polite while making logical errors).
Benchmark for customer service agents
The team (Shi, Dai, Wang et al.) presents SAGE (Service Agent Graph-guided Evaluation) — a benchmark that formalizes unstructured SOPs (Standard Operating Procedures) into Dynamic Dialogue Graphs and tests how well LLMs follow them in practice.
SAGE tested 27 LLMs across 6 industrial scenarios — the largest evaluation of its kind for service agents.
Two key phenomena
Execution Gap
Models correctly classify the user’s intent (they know what the user wants), but don’t perform the correct next actions according to the SOP. Understanding does not equal execution.
Empathy Resilience
Under high adversarial pressure, models maintain a polite conversational facade while making logical errors beneath the surface. The user gets the impression that the agent is competent, when in reality it is doing the wrong things — a deceptive mode of failure.
Why it matters
For companies using AI agents in customer service, this is a warning: standard benchmarks that only measure “does the agent understand the question” miss a critical dimension — “does the agent do the right thing after understanding.” SAGE introduces an adversarial taxonomy of intents and a modular extension mechanism for testing in new domains at low cost.