PyTorch SMG: CPU-GPU disaggregation in LLM serving delivers 3.5× output throughput for Llama 3.3 70B FP8, already in production on Google Cloud, Oracle, and Alibaba
LightSeek Foundation presented Shepherd Model Gateway (SMG) on the PyTorch blog on April 30, 2026 — a Rust gateway that moves CPU-bound tasks (tokenization, MCP orchestration, chat history, multimodal preprocessing) out of the GPU process into a separate gRPC layer. Llama 3.3 70B FP8 achieves 1,150 vs 327 output tokens/s (3.5× throughput), and the solution is already in production on Google Cloud, Oracle Cloud, Alibaba Cloud, and TogetherAI.