CNCF: AI agent retrieval on Kubernetes bug fixes

A CNCF benchmark compares RAG, hybrid and pure local retrieval for AI agents fixing real Kubernetes bugs. RAG is fastest (1m16s), but the key bottleneck is not speed or cost — it is 'scope discovery', the agent's ability to recognise all affected code across multiple files.

What did CNCF test?

The Cloud Native Computing Foundation published on 8 May 2026 a benchmark in which an AI agent — running on the Claude Opus model with a five-minute timeout — fixed nine real Kubernetes bugs of varying complexity, drawn from open pull requests. The goal was not to prove the superiority of one model but to compare three data retrieval architectures that dominate agentic systems today.

Which three strategies were compared?

Three approaches were tested under equal conditions:

RAG only — semantic search through a KAITO/Qdrant index of the repository.
Hybrid — RAG combined with direct access to the local filesystem.
Local only — the agent uses only grep, find and other tools over the cloned repo.

RAG was fastest at an average of 1 minute 16 seconds, while the hybrid and local approaches required around 2 minutes 25 seconds.

What is the real bottleneck?

The key finding of the study is that speed and tokens are not the decisive factor. The hybrid approach required an average of 8 model calls (264k tokens total), while RAG and local converged around 187–189k tokens. The number of calls proved to be a more important cost driver than token volume.

The real bottleneck CNCF calls “scope discovery” — the agent’s ability to identify all files that need changing. Agents routinely succeeded in fixing the primary bug location but missed adjacent changes at integration points. In one case the agent “swallowed errors locally instead of propagating them to the caller — functionally similar, but architecturally wrong”.

What does this mean for engineers?

The result is uncomfortable for a community investing in ever more sophisticated retrieval pipelines: when the bug description is precise (exact files, functions), the differences between strategies practically disappear. Differences become large only with poorly described bugs. The conclusion: issue description quality dominates over retrieval strategy, and systematic reasoning through architecture is still missing in agents regardless of how context is supplied to them.

Frequently Asked Questions

What is RAG in the context of AI agents?

Retrieval-Augmented Generation — the agent first fetches relevant code sections from a vector database (KAITO/Qdrant), then uses them as context for generating a solution.

What does 'scope discovery' mean?

The agent's ability to identify all files and code locations that need to be changed for a complete bug fix, not just the primary error location.

Why is the number of model calls more important than token count?

The hybrid approach uses an average of 8 calls (vs. 187k–264k tokens), making it the most expensive — each call has fixed costs in addition to per-token charges.

CNCF: Three data retrieval strategies for AI agents on Kubernetes bug fixes

What did CNCF test?

Which three strategies were compared?

What is the real bottleneck?

What does this mean for engineers?

Frequently Asked Questions

Sources

Related news