ArXiv SAGA: workflow-atomic GPU scheduling for AI agents achieves 1.64× faster task completion on a 64-GPU cluster, accepted at HPDC 2026
The team of Dongxin Guo, Jikun Wu, and Siu Ming Yiu presented on May 1, 2026 SAGA — a workflow-atomic scheduler for AI agents on GPU clusters that treats the entire agent workflow as a single schedulable unit instead of individual LLM calls. The system achieves a 1.64× geometric mean reduction in task completion time on a 64-GPU cluster and 99.2% SLO attainment under multi-tenant load. The paper was accepted at HPDC 2026 in Cleveland (July 13–16, 2026).
This article was generated using artificial intelligence from primary sources.
The team of Dongxin Guo, Jikun Wu, and Siu Ming Yiu published on May 1, 2026 on ArXiv the paper “SAGA: Workflow-Atomic Scheduling for AI Agent Inference on GPU Clusters”. The paper was accepted at HPDC 2026 (35th International Symposium on High-Performance Parallel and Distributed Computing, July 13–16, 2026, in Cleveland).
What problem does SAGA solve?
Existing GPU schedulers treat every API call to an LLM as an independent request, meaning they discard gigabytes of intermediate state (KV cache, attention contexts, scratch memory) after each call completes. This is suboptimal for AI agents, where a single workflow typically encompasses dozens of consecutive calls that share a large amount of context.
The authors frame the problem this way: “GPU schedulers treat each call as independent, discarding gigabytes of intermediate state.” The consequence is that an agent that should take a few seconds often runs for minutes because the scheduler constantly reloads state that should have remained in memory.
How does the system address the problem?
SAGA treats the entire agent workflow as an atomic scheduling unit. Technically, the system uses three key mechanisms:
- Agent Execution Graphs — dependency models within the workflow that enable prediction of which KV cache pages will be needed later
- Session-affinity batching — co-locates correlated requests, balancing load across GPUs without losing state
- Fairness mechanisms — prevent a single long-running workflow from blocking other tenants
KV cache prediction achieves 1.31× Bélády’s optimal offline — Bélády’s algorithm is the theoretical upper bound of ideal cache replacement that knows the future. Achieving 1.31× of that online (without knowledge of the future) is a significant achievement.
How large are the improvements?
Experiments on a 64-GPU cluster showed:
- 1.64× geometric mean reduction in task completion time (statistically significant, p < 0.001)
- 1.22× better GPU memory utilization — less wastage on unused KV cache pages
- 99.2% SLO attainment under multi-tenant load (Service Level Objective — the contracted latency bound)
The trade-off is approximately 30% lower peak throughput compared to classic batch scheduling. This is expected: the system sacrifices raw throughput for better task completion time and memory utilization. For agent workloads where the user is waiting for the entire workflow’s response, task completion time reduction is a more useful metric than peak throughput.
What does this mean for operators of agent systems?
The commercial implication is significant: if AWS, Azure, or Google Cloud adopt workflow-atomic scheduling in their GPU pools, the infrastructure cost of agent systems could fall by a magnitude similar to the 1.64× speedup. For enterprises already spending tens of thousands of dollars monthly on agent inference, that is significant enough to influence build-vs-buy decisions.
The paper is available on ArXiv under ID 2605.00528.
Frequently Asked Questions
- What does 'workflow-atomic' mean in the context of SAGA?
- Instead of the scheduler treating each LLM call independently (and discarding gigabytes of intermediate state between calls), SAGA treats the entire agent workflow as one indivisible unit. This enables KV cache prediction, session-affinity batching, and better GPU memory utilization.
- What are the key technical results?
- 1.64× geometric mean reduction in task completion time (p < 0.001), 1.31× Bélády optimal for KV cache, 1.22× better GPU memory utilization, 99.2% SLO attainment. The trade-off is approximately 30% lower peak throughput compared to batch scheduling.
- Where will the paper be presented?
- At HPDC 2026 — the 35th International Symposium on High-Performance Parallel and Distributed Computing, July 13–16, 2026, in Cleveland, Ohio. The paper is available on ArXiv under ID 2605.00528.
Sources
Related news
AMD Primus Projection: Tool for Predicting LLM Training Memory and Speed Before Running on Instinct GPU Clusters
Google at Cloud Next '26 unveils TPU 8i and TPU 8t: specialized chips for agentic AI computing
Gemma 4 runs as a Vision Language Agent locally on Jetson Orin Nano Super