CNCF: observability for AI agents — new framework

The CNCF blog explains why the classical approach to system monitoring does not work for AI agents and LLM models that operate probabilistically — the same query can produce entirely different results, and errors are semantic rather than technical.

What is observability and why do AI systems break the old rules?

Observability is the ability to understand the internal state of a system from external signals — logs, metrics, and traces. In classical software systems the same input always produces the same output, so anomalies are easy to spot through elevated latency or error rates. CNCF highlights that this principle simply does not hold for LLM models and AI agents operating in probabilistic environments.

Why probabilism changes everything

The same prompt can produce entirely different responses depending on temperature, context, and model state. Errors are not always technical — an agent can respond without throwing an exception but still make the wrong decision. Classical telemetry cannot see this. While Prometheus and Grafana record CPU, memory, and HTTP statuses, they miss the semantic layer: did the agent understand the task, did it follow the right step, was the result useful?

A framework for sustainable agentic observability

CNCF proposes shifting focus from infrastructure metrics to outcome reliability. Instead of ‘is the service available?’, the question becomes ‘was the decision correct?’. Concretely, this means tracking semantic prompt/response patterns, decision quality, and consistency of agent behavior over time. The approach offers a framework for ‘sustainable’ observability that does not generate data noise but instead measures what truly affects results.

Unlike deterministic microservices where a single tool covers everything, agentic systems require a new monitoring layer — one specific to LLM interactions and autonomous decisions.

Frequently Asked Questions

Why is classical observability insufficient for AI agents?

AI systems operate probabilistically — the same prompt can produce different results, so infrastructure metrics (latency, CPU) do not reveal semantic errors in an agent's decisions.

What does CNCF recommend monitoring in agentic systems?

CNCF proposes tracking semantic patterns of prompt/response pairs and decision quality, rather than just technical indicators such as error counts or service availability.

CNCF: why classical observability does not work for agentic and LLM systems

What is observability and why do AI systems break the old rules?

Why probabilism changes everything

A framework for sustainable agentic observability

Frequently Asked Questions

Sources

Related news