🤖 24 AI
🟢 🤝 Agents Thursday, April 16, 2026 · 2 min read

IBM Research: VAKRA Benchmark Reveals AI Agents Fail on Complex Reasoning

Why it matters

IBM Research has published VAKRA — a new benchmark for evaluating AI agents in enterprise environments, comprising more than 8,000 local APIs, 62 domains, and 4,187 test instances. The key finding is that models display surface-level competence on simple tasks but fail on compositional reasoning, multi-hop reasoning degrades with depth, and adherence to external constraints causes a significant performance drop.

IBM Research has published VAKRA — a new benchmark that puts AI agents to the test in realistic enterprise scenarios. With more than 8,000 local APIs, 62 domains, and 4,187 test instances, VAKRA is one of the most comprehensive evaluation frameworks for testing agentic capabilities.

Where Do AI Agents Fall Short?

The key finding is the gap between surface-level competence and genuine reliability. AI agents successfully handle simple tasks requiring one or two API calls, but performance drops dramatically when a task requires compositional reasoning — the ability to combine multiple tools and steps into a coherent plan.

Multi-hop reasoning is particularly problematic: each additional step in the chain reduces accuracy, and agents often “lose the thread” after three to four steps. This is especially relevant for enterprise scenarios where business process automation requires exactly these kinds of multi-step operations.

Why Is Rule Adherence So Difficult?

VAKRA also tests what it calls policy adherence — an agent’s ability to respect external constraints on tool use. For example, an agent may have access to an API for deleting user data, but company policy requires prior authorization from a supervisor.

Results show that agents make significant errors in this area, often executing actions without checking constraints or completely ignoring policies. For companies considering autonomous AI agents in business processes, this is a signal that a robust governance and oversight layer is needed on top of raw agent capabilities.

🤖

This article was generated using artificial intelligence from primary sources.