What is Kepler and what is it used for?

Kepler is a CNCF sandbox project that measures energy consumption in Kubernetes clusters, attributes it to individual Linux processes and Pods, and exports results as Prometheus metrics. It is the primary tool for tracking energy use of AI and ML workloads.

What were the main problems with the old architecture and how does the new one solve them?

The old architecture used eBPF and required CAP_BPF and CAP_SYSADMIN privileges, missed short-lived processes, and caused multi-kilowatt measurement spikes. The new one reads standard /proc and /sys paths, operates read-only, dynamically discovers the power meter structure, and requires no kernel privileges.

What are the experimental results of the new architecture?

The new kepler_node_cpu_watts tracks IPMI ground truth without kW spikes, and the process power attribution gap has been reduced to the milliwatt level. Test coverage reached 90%.

CNCF Kepler: Precise Pod Energy Measurement Without eBPF

CNCF sandbox project Kepler has been completely rewritten: the new architecture replaces the eBPF approach by reading standard /proc and /sys paths, eliminates multi-kilowatt measurement spikes, and reduces the process power attribution gap to the milliwatt level.

Kepler (Kubernetes-based Efficient Power Level Exporter), a CNCF sandbox project that has been measuring energy consumption in Kubernetes clusters since 2023, has received a completely new architecture. The team published a detailed technical rationale for the rewrite, experimental results, and a community call to action — particularly relevant for organizations running AI and ML workloads whose energy consumption is becoming an increasingly important metric.

Kubernetes natively provides no mechanism for tracking how much energy a given Pod or workload consumes. Cluster administrators can see CPU and memory usage, but not watts — which becomes a problem when organizations want to track their carbon footprint, optimize energy costs, or fulfill ESG reporting requirements. Kepler fills that gap: it reads hardware power meters, attributes consumption to individual Linux processes and Pods, and exports the results as Prometheus metrics.

The Old Architecture and Its Problems

The original Kepler relied on eBPF (extended Berkeley Packet Filter) to capture utilization signals. This approach had several serious limitations in production environments:

It required CAP_BPF and CAP_SYSADMIN privileges — something many security teams do not permit for standard monitoring tools. Production Kubernetes clusters often have strict container privilege policies, so Kepler was blocked at the deployment stage.

The eBPF approach missed short-lived processes that had finished before the kernel probe could record them. In AI/ML workloads that heavily use short batch tasks, this imprecision could accumulate.

The most visible symptom was the appearance of multi-kilowatt spikes in measurements — implementation artifacts that did not reflect actual physical consumption, but contaminated metrics and dashboards.

New Architecture Without Kernel Privileges

New Architecture and a Precision Leap

The new Kepler has completely abandoned eBPF. Instead, it reads standard /proc and /sys paths that the Linux kernel exposes for all processes — without any kernel privileges. The approach is read-only: Kepler does not write to the kernel or inject code.

The key innovation is runtime dynamic discovery of the power meter structure. The old architecture assumed a fixed hardware topology, which caused errors on servers with different configurations of DRAM, socket, and package levels. The new version reads the structure from /sys at startup and adapts without manual configuration.

Deployment has been simplified to a single Helm chart, which has significantly reduced the learning curve and the number of configurable parameters.

Experimental Results

The team ran two key experiments to validate the new architecture:

Experiment 1 compared the new kepler_node_cpu_watts against IPMI ground truth measurements (physical sensors on server hardware). Result: the new metric tracks the IPMI pattern without the multi-kilowatt spikes that characterized the old implementation.

Experiment 2 measured the power attribution gap at the process level — the difference between the sum of attributed power for all processes and the total measured node power. The gap was reduced to the milliwatt level (rather than watts or kilowatts as before), confirming that the new architecture consistently distributes total consumption.

Test coverage reached 90%, which is a high level for an infrastructure tool of this type.

Community Call to Action

The Kepler team identifies four areas where it is seeking contributions:

GPU monitoring for AI/ML workloads remains an unsolved problem — the current architecture covers CPU, but GPU power attribution per Pod is more complex due to how NVIDIA and AMD expose metrics. This is especially relevant for organizations running LLM inferencing or training in Kubernetes.

Power modeling for VM environments requires an ML approach because the virtualization layer hides physical meters. The team is looking for experts who can train power estimation models.

Validation against physical meters (IPMI, external wattmeters) and improved idle power attribution are two additional open problems.

For organizations that are already measuring the energy consumption of AI infrastructure, the new version of Kepler represents a more stable foundation for integration into existing Prometheus/Grafana stacks without compromising cluster security policies.

CNCF Kepler Rearchitected From the Ground Up: Precise Pod Energy Measurement Without Kernel Privileges

Why Is Kubernetes Blind to Energy Consumption?

The Old Architecture and Its Problems

New Architecture Without Kernel Privileges

New Architecture and a Precision Leap

Experimental Results

Community Call to Action

Frequently Asked Questions

Sources

Related news