Research

Germany

Digital Sustainability

NADIKI: Why we chose the Observer architecture over a Kubernetes plugin

NADIKI: Why we chose the Observer architecture over a Kubernetes plugin

In our initial concept of NADIKI, we intended to develop a Kubernetes extension for measuring environmental impacts. However, existing approaches like Scaphandre and Kepler have fundamental limitations: they only measure CPU energy consumption via Intel RAPL, provide estimates in virtualized environments, and require cloud providers to grant access to hardware interfaces. Therefore, we opted for an observer architecture that accurately reflects the relationship between infrastructure providers and clients, delivering more precise data without breaching security boundaries.

In Work Package 1 of the NADIKI project, an open-source extension for Kubernetes was intended to measure the resource consumption of AI applications at runtime. After a thorough analysis of existing solutions, we abandoned this approach and instead conceived an Observer Architecture. This article describes the reasons for this decision.

Limitations of Existing Kubernetes Energy Measurement

Two open-source projects dominate energy measurement in Kubernetes environments: Scaphandre and Kepler. Both encounter the same structural issues in practice.

Scaphandre measures energy consumption through Intel RAPL (Running Average Power Limit) and provides the data in Kubernetes. Three limitations make this approach unsuitable for our use case:

  • CPU Consumption Only: RAPL captures only the CPU's energy consumption. GPUs, network cards, storage media, and other peripherals remain invisible—especially for AI workloads that rely on GPU acceleration, the GPUs need to be included.

  • Dependence on Cloud Provider: Access to RAPL interfaces requires the infrastructure provider to allow and support this in the virtualized environment. In practice, this is rarely the case.

  • Performance Overheads: With high-frequency measurements (e.g., every 5–15 seconds), Scaphandre (and RAPL) generates significant overhead on the measured system—especially problematic under high load.

Kepler follows a similar approach and also relies heavily on RAPL. In virtualized environments—which, in our assessment, constitute the majority of productive AI workloads—Kepler resorts to estimation models instead of actual measurements.

Both tools share a fundamental architectural problem: They attempt to solve energy measurement within the customer's Kubernetes environment. This either requires privileged access to hardware interfaces or accepts imprecise estimates.

The Reality of Cloud Infrastructure as a Starting Point

Our approach aligns with the actual relationship between infrastructure provider and customer. In practice, there is a clear separation: The provider operates the physical infrastructure, the customer runs their workloads in a virtualized environment. This separation is intentional—for security reasons, the customer does not have access to the physical server's hardware interfaces.

The Observer Architecture respects this separation, rather than circumventing it:

  • On the Provider's Side: The provider installs an exporter on the physical machine. This registers the server with the NADIKI registrar according to the NADIKI API specification and exports metrics directly via IPMI and NVIDIA's GPU power tools. This captures the energy consumption of the entire server and all GPUs—far more accurately than RAPL-based measurements.

  • On the Customer's Side: The standard Prometheus exporter in Kubernetes is used—without additional plugins or extensions. The exporter transmits resource usage information per cluster and "pod" to the registrar.

  • The Registrar: Links the physical machine with the Kubernetes metrics and calculates the environmental impact per workload. The only information the provider needs to pass to the customer is the physical server ID assigned by the registrar upon registration.

Advantages Over a Kubernetes Plugin

  • No Security Layer Opened: The customer does not need access to IPMI or GPU tools. The provider does not need to pass hardware interfaces into the virtualized environment.

  • No Additional Tooling for the Customer: Standard Kubernetes monitoring is sufficient. The existing Prometheus infrastructure is used, creating no new dependencies.

  • Complete Capture: IPMI provides the total energy consumption of the server. NVIDIA SMI captures GPU consumption. The restriction to CPU consumption via RAPL is eliminated.

  • No Performance Overhead: Measurements occur on the physical level, not within the virtualized workload. The measured system is not additionally burdened.

  • Scalable Beyond Server Limits: The registrar aggregates not only server data but also rack and data center metrics—cooling, water consumption, grid power emission factors. A Kubernetes plugin could only reach these levels through complex extensions.

Further Resources