ArXiv SAGE: 27 LLMs tested — models understand intent but don't execute correctly

Benchmark for customer service agents

The team (Shi, Dai, Wang et al.) presents SAGE (Service Agent Graph-guided Evaluation) — a benchmark that formalizes unstructured SOPs (Standard Operating Procedures) into Dynamic Dialogue Graphs and tests how well LLMs follow them in practice.

SAGE tested 27 LLMs across 6 industrial scenarios — the largest evaluation of its kind for service agents.

Two key phenomena

Execution Gap

Models correctly classify the user’s intent (they know what the user wants), but don’t perform the correct next actions according to the SOP. Understanding does not equal execution.

Empathy Resilience

Under high adversarial pressure, models maintain a polite conversational facade while making logical errors beneath the surface. The user gets the impression that the agent is competent, when in reality it is doing the wrong things — a deceptive mode of failure.

Why it matters

For companies using AI agents in customer service, this is a warning: standard benchmarks that only measure “does the agent understand the question” miss a critical dimension — “does the agent do the right thing after understanding.” SAGE introduces an adversarial taxonomy of intents and a modular extension mechanism for testing in new domains at low cost.