What is the VAKRA benchmark?

It is IBM's evaluation framework for testing AI agents with 8,000+ APIs and 4,187 test instances that measures agents' ability to use tools, reason through multiple steps, and respect external constraints.

IBM Research: VAKRA Benchmark Reveals AI Agents Fail on Complex Reasoning

Q: Where do AI agents most often fail?

Agents show surface-level competence on simple tasks but fail on compositional reasoning — the ability to combine multiple steps and tools into a coherent plan for solving complex tasks.

IBM Research has published VAKRA — a new benchmark that puts AI agents to the test in realistic enterprise scenarios. With more than 8,000 local APIs, 62 domains, and 4,187 test instances, VAKRA is one of the most comprehensive evaluation frameworks for testing agentic capabilities.

Where Do AI Agents Fall Short?

The key finding is the gap between surface-level competence and genuine reliability. AI agents successfully handle simple tasks requiring one or two API calls, but performance drops dramatically when a task requires compositional reasoning — the ability to combine multiple tools and steps into a coherent plan.

Multi-hop reasoning is particularly problematic: each additional step in the chain reduces accuracy, and agents often “lose the thread” after three to four steps. This is especially relevant for enterprise scenarios where business process automation requires exactly these kinds of multi-step operations.

Why Is Rule Adherence So Difficult?

VAKRA also tests what it calls policy adherence — an agent’s ability to respect external constraints on tool use. For example, an agent may have access to an API for deleting user data, but company policy requires prior authorization from a supervisor.

Results show that agents make significant errors in this area, often executing actions without checking constraints or completely ignoring policies. For companies considering autonomous AI agents in business processes, this is a signal that a robust governance and oversight layer is needed on top of raw agent capabilities.

IBM Research: VAKRA Benchmark Reveals AI Agents Fail on Complex Reasoning

Where Do AI Agents Fall Short?

Why Is Rule Adherence So Difficult?

Sources

Related news