The Cloud Native Computing Foundation is a non-profit organization under the Linux Foundation that manages key cloud-native projects such as Kubernetes, Prometheus, and Helm.

What is the ReAct pattern?

ReAct (Reasoning + Acting) is an agentic AI architecture pattern where an LLM alternately reasons and calls tools, iteratively investigating a problem instead of giving a one-shot answer.

HolmesGPT and CNCF tools auto-diagnose Kubernetes alerts for $0.04

STCLab, whose two-person SRE team manages multiple Amazon EKS clusters serving traffic in 200 countries, published a detailed breakdown of their production integration of HolmesGPT with CNCF tools for automatic diagnosis of Kubernetes alerts.

Architecture and workflow

The core of the system is HolmesGPT’s ReAct pattern, which enables the language model to independently select investigative tools based on alert context. Prometheus alerts pass through Robusta OSS, which enriches them with metadata before sending them to Slack. Kubernetes is an open-source container orchestration platform, Prometheus is the standard system for metrics and alerting, and CNCF is the Cloud Native Computing Foundation under the Linux Foundation. Per alert, HolmesGPT initiates an investigation using tools such as Inspector Gadget and KubeAI, and returns the results to the same Slack thread where the alert was triggered. A custom 200-line Python script connects Slack threads, removes duplicates, and routes events to the appropriate runbooks.

Numbers that justify the cost

The cost of a single investigation is approximately $0.04 USD, which comes to around $12 per month for the entire system. Deduplication reduces daily raw alerts from 40 to approximately 12 unique investigations. Engineers complete analysis in under two minutes, compared to the previous 15 to 20 minutes. Approximately 40% of investigations are resolved autonomously, without human intervention. This cost-to-efficiency ratio makes the investment trivial compared to the cost of SRE engineer time.

Lesson: runbooks matter more than models

The authors particularly emphasize that the quality of structured runbooks determines the success of an investigation more than the choice of LLM. A controlled test with the same model gave a score of 4.6 out of 5 when runbooks existed, and only 3.6 without them on the same alerts. The team maintains seven namespace-specific runbooks, each with metadata listing available tools. They use a hybrid deployment — self-hosted HolmesGPT for staging and a managed API for production. The entire stack relies exclusively on CNCF projects: HolmesGPT, Kubernetes, Prometheus, Robusta OSS, Inspector Gadget, and KubeAI.

HolmesGPT and CNCF tools auto-diagnose Kubernetes alerts for $0.04

HolmesGPT and CNCF tools auto-diagnose Kubernetes alerts for $0.04

Architecture and workflow

Numbers that justify the cost

Lesson: runbooks matter more than models

Sources

Related news