arXiv:2605.21427: PALS — power-aware LLM serving for MoE models achieves +26.3% energy efficiency and 4-7× fewer QoS violations
Researchers published PALS on 21 May 2026 on the arXiv preprint server — a runtime system that integrates GPU power control directly into LLM serving for Mixture-of-Experts models. PALS uses lightweight offline power-performance models and a feedback controller that dynamically optimises configurations against throughput targets. It achieves 26.3% improvement in energy efficiency and 4-7× reduction in QoS violations under power constraints, integrates into vLLM without modifying the API or retraining models. It addresses a growing operational pain point for data centres — GPU cluster energy consumption that is becoming the dominant constraint on growth.
This article was generated using artificial intelligence from primary sources.
A group of researchers (see full author list on arXiv) published on 21 May 2026 the preprint PALS — Power-Aware LLM Serving for Mixture-of-Experts Models (arXiv:2605.21427), presenting a runtime system for optimising energy consumption in LLM serving infrastructure. PALS directly addresses a problem that in 2025-2026 has become the dominant operational constraint on AI infrastructure growth — energy consumption of GPU data centres.
What does PALS concretely do?
PALS is a layer inserted between the vLLM serving framework and the GPU hardware. It operates in three steps:
Offline modelling — researchers build lightweight offline models that connect GPU power state (DVFS — Dynamic Voltage and Frequency Scaling) with inference latency and throughput for different expert configurations. The models are small (KB-sized) and do not require real-time ML inference.
Online feedback controller — at runtime, PALS monitors the current workload (number of concurrent requests, input token rate, expert utilisation patterns) and dynamically adjusts the GPU power state. The goal is to minimise energy consumption for given SLA targets (p95 latency, throughput target).
vLLM integration — everything happens through vLLM scheduler hooks. The existing vLLM API remains unchanged. Models do not need to be retrained or modified. This is a significant engineering choice because it enables drop-in deployment into existing serving stacks.
What are the concrete performance results?
PALS shows the following results in experiments:
- +26.3% improvement in energy efficiency (units: tokens generated per joule consumed)
- 4-7× reduction in QoS violation rate under power capping constraints
- No throughput degradation at standard power budgets
Energy efficiency is a particularly significant metric for hyperscale operators (Meta, Google, Microsoft, AWS, Anthropic, OpenAI) where GPU energy cost constitutes a significant share of LLM inference operational expenses.
Why are MoE models particularly interesting?
Mixture-of-Experts architecture (Mixtral 8x22B, DeepSeek V3 256-expert sparse, Qwen MoE variants) has a heterogeneous computation profile — different experts are activated for different input sequences, meaning a fixed power state is not optimal.
Classical LLM serving stacks treat MoE models as if they were dense — applying the same power state to the entire GPU regardless of which subset of experts is activated. PALS exploits this variability — when the model is currently running a computationally lighter path, the GPU power state is lowered without impacting latency.
What does this mean for AI infrastructure?
Energy efficiency is in 2026 a critical scaling factor for all hyperscale operators. NVIDIA H100 and B200 GPU clusters consume significant megawatts of electricity, and access to power has become a serious constraint on building new data centres (known as the “power gap” problem).
PALS — and all similar optimisation techniques — become strategically important for the economics of the serving stack. A 26.3% improvement means the same throughput can be achieved with 26.3% less GPU capacity, or that existing GPU clusters can be scaled 26.3% further without additional electrical power.
For the open source community, integration into vLLM (the most popular open-source LLM serving framework) means PALS could become the first widely adopted power-aware serving layer. It is worth watching whether the authors will publish a reference implementation or contribute directly to the vLLM mainline branch.
Frequently Asked Questions
- What does PALS specifically change in the vLLM serving stack?
- PALS adds a GPU power control layer that dynamically adjusts power states (DVFS) depending on current workload and SLA targets, integrated directly into the vLLM scheduler.
- What are the concrete performance figures of the PALS system?
- +26.3% energy efficiency and 4-7× fewer SLO violations under power constraints, without retraining models or changing the serving API.
- For which models is PALS designed?
- Mixture-of-Experts (MoE) models such as Mixtral, DeepSeek V3, Qwen MoE variants — where different experts have different computation profiles.