IBM Research: VAKRA Benchmark Reveals AI Agents Fail on Complex Reasoning
Why it matters
IBM Research has published VAKRA — a new benchmark for evaluating AI agents in enterprise environments, comprising more than 8,000 local APIs, 62 domains, and 4,187 test instances. The key finding is that models display surface-level competence on simple tasks but fail on compositional reasoning, multi-hop reasoning degrades with depth, and adherence to external constraints causes a significant performance drop.
IBM Research has published VAKRA — a new benchmark that puts AI agents to the test in realistic enterprise scenarios. With more than 8,000 local APIs, 62 domains, and 4,187 test instances, VAKRA is one of the most comprehensive evaluation frameworks for testing agentic capabilities.
Where Do AI Agents Fall Short?
The key finding is the gap between surface-level competence and genuine reliability. AI agents successfully handle simple tasks requiring one or two API calls, but performance drops dramatically when a task requires compositional reasoning — the ability to combine multiple tools and steps into a coherent plan.
Multi-hop reasoning is particularly problematic: each additional step in the chain reduces accuracy, and agents often “lose the thread” after three to four steps. This is especially relevant for enterprise scenarios where business process automation requires exactly these kinds of multi-step operations.
Why Is Rule Adherence So Difficult?
VAKRA also tests what it calls policy adherence — an agent’s ability to respect external constraints on tool use. For example, an agent may have access to an API for deleting user data, but company policy requires prior authorization from a supervisor.
Results show that agents make significant errors in this area, often executing actions without checking constraints or completely ignoring policies. For companies considering autonomous AI agents in business processes, this is a signal that a robust governance and oversight layer is needed on top of raw agent capabilities.
This article was generated using artificial intelligence from primary sources.
Related news
Anthropic: Memory for Managed Agents in public beta — AI agents that remember context between sessions
GitHub: Cloud agent sessions now available directly from issues and project views
ArXiv SWE-chat — a dataset of real developer interactions with AI coding agents in production