SocialReasoning-Bench: AI agents leave value on the table

SocialReasoning-Bench is a new Microsoft Research benchmark measuring whether an AI agent defends the user's actual interests during negotiations with other parties — not just whether it completes the task. Results show that models close deals almost perfectly but consistently leave value on the table, with 90%+ ineffective or negligent outcomes in marketplace scenarios.

Microsoft Research has published SocialReasoning-Bench, a new benchmark evaluating whether AI agents can represent a user’s interests in negotiations with other parties. The goal is to fill a critical gap in existing evaluations: agents complete tasks, but often with suboptimal outcomes for the person they represent.

Two domains, three metrics

The benchmark tests two domains. In Calendar Coordination, an assistant schedules meetings within user preferences against a requester with opposing interests. In Marketplace Negotiation, a buyer negotiates within a defined “reservation” price against a seller. Three things are measured: Outcome Optimality (value captured for the principal, on a 0–1 scale), Due Diligence (process quality relative to a reasonable agent policy), and Duty of Care (requiring both simultaneously to confirm trustworthy delegation).

What the results show

Agents schedule meetings and close deals “in nearly all cases, but consistently achieve suboptimal terms,” writes the team. Marketplace outcomes for almost all models are near zero on outcome optimality — meaning the counterparty captured virtually all surplus. Calendar outcomes are better but below mid-scale, suggesting that agents agree to the requester’s preferences more often than the user’s.

Better prompts are not enough

Defensive prompts help — GPT-5.4 gains +0.21 in calendar Outcome Optimality — but they do not close the gap between capable and incapable representation. Adversarial counterparties further erode results: agents rarely reject manipulative requests in calendar tasks, suggesting vulnerability to social engineering. The team classifies behavior into four archetypes: Robust, Lucky, Ineffective, Negligent. Calendar tasks show 50%+ Robust performance; marketplace shows 90%+ Ineffective or Negligent.

What this means for autonomous agents

The results raise serious questions about trustworthy delegation. Microsoft Research draws a parallel with the duties lawyers and financial advisors owe clients — when agents start operating in networked environments, weak negotiation skills cascade through systems and can lead to accumulated value loss.

Frequently Asked Questions

What does SocialReasoning-Bench measure differently from standard benchmarks?

Standard benchmarks measure successful task completion. SocialReasoning-Bench adds two dimensions: outcome optimality (how much value was captured for the user, on a 0–1 scale) and due diligence (process quality relative to a reasonable-agent policy). This separates luck from skill.

Which two domains are tested?

Calendar Coordination (agent schedules meetings within user preferences against an agent with opposing interests) and Marketplace Negotiation (agent negotiates price within set limits against a seller). Calendar shows 50%+ robust behavior; marketplace shows 90%+ ineffective or negligent outcomes.

Do better prompts help?

Partially. Defensive prompting helps — GPT-5.4 gains +0.21 in calendar Outcome Optimality — but it does not close the gap between capable and incapable representation. Adversarial counterparties are particularly effective at destroying outcome optimality; agents rarely refuse manipulative requests.

Microsoft Research: SocialReasoning-Bench reveals AI agents complete tasks but fail to defend user interests

Two domains, three metrics

What the results show

Better prompts are not enough

What this means for autonomous agents

Frequently Asked Questions

Sources

Related news