PyTorch SMG: CPU/GPU disaggregation, 3.5× throughput Llama 3.3

LightSeek Foundation presented Shepherd Model Gateway (SMG) on the PyTorch blog on April 30, 2026 — a Rust gateway that moves CPU-bound tasks (tokenization, MCP orchestration, chat history, multimodal preprocessing) out of the GPU process into a separate gRPC layer. Llama 3.3 70B FP8 achieves 1,150 vs 327 output tokens/s (3.5× throughput), and the solution is already in production on Google Cloud, Oracle Cloud, Alibaba Cloud, and TogetherAI.

LightSeek Foundation published Shepherd Model Gateway (SMG) on the official PyTorch blog on April 30, 2026 — a project arguing that in modern LLM serving, the CPU has become a bottleneck for expensive GPU clusters. SMG moves all CPU-bound tasks from the GPU process into a separate Rust gateway layer that communicates with the engine via gRPC. The authors — Simo Lin, Chang Su, and Keyang Ru — describe the architecture as “GPUs must do tensor math, everything else belongs in a separate serving layer.”

What problem does disaggregation actually solve?

Python’s GIL (Global Interpreter Lock) restricts CPU-bound tasks like tokenization and detokenization to single-thread execution, even when a Rust or C++ tokenizer library runs underneath. In SGLang and vLLM this becomes a bottleneck under real production traffic — every microsecond of GIL-bound tokenization is a microsecond where a GPU worth hundreds of thousands of dollars sits idle. On large prefill-decode disaggregated serving with expert parallelism, this accumulates into a significant loss of hardware utilization.

How is SMG architecture structured?

SMG identifies every CPU-bound workload that is otherwise interleaved with the GPU process: tokenization, detokenization, reasoning output parsing, function call extraction, MCP tool orchestration, multimodal preprocessing, chat history management, structured output validation, stop sequence detection. All these tasks have been moved to a Rust gateway communicating with the engine via a minimal gRPC protocol — the engine receives pre-tokenized input and streams output tokens, while the gateway handles everything else. Tokenization in Rust uses a two-level cache: L0 exact-match for repeated prompts and L1 prefix-aware at special token boundaries.

What does SMG offer development teams?

A single SMG process fronts an entire fleet — multiple models, multiple engines, one entry point. It can route requests across SGLang, vLLM, TensorRT-LLM, and MLX backends simultaneously, and supports OpenAI, Anthropic, Google Gemini, AWS Bedrock, and Azure OpenAI as external providers. Native APIs include Chat Completions, Responses API, Anthropic Messages API (with ThinkingConfig and interleaved reasoning blocks), Gemini Interactions API, and Realtime API over WebSockets/WebRTC. The authors particularly highlight the multimodal component — they rewrote parts of HuggingFace transformers image processing from Python to Rust, which they describe as an industry first.

Why is this important for the open-source LLM ecosystem?

SMG argues that the inference engine and gateway should evolve independently: the engine can be improved with new GPU kernels and quantization without touching the gateway, while the gateway gains new parsers, tools, and protocols without touching the engine. The boundary interface between them (smg-grpc-proto on PyPI) becomes a stable contract. Production deployments include Google Cloud, Oracle Cloud, Alibaba Cloud, and TogetherAI — suggesting that disaggregation is no longer an academic concept but an operational pattern in the industry.

Frequently Asked Questions

What is the main problem SMG solves?

Python GIL (Global Interpreter Lock) restricts CPU-bound tasks like tokenization and tool orchestration to single-thread execution, slowing down expensive GPU clusters. SMG moves all those tasks into a Rust gateway outside the Python process.

What is the actual performance gain?

On the Llama 3.3 70B FP8 model, output throughput jumps from 327 to 1,150 tokens/s (3.5× faster). On long-context scenarios, the average is +12.2% average throughput across different configurations.

Who is already using SMG in production?

Google Cloud, Oracle Cloud, Alibaba Cloud, and TogetherAI are listed as production deployments. The project reached thirteen releases in six months.

PyTorch SMG: CPU-GPU disaggregation in LLM serving delivers 3.5× output throughput for Llama 3.3 70B FP8, already in production on Google Cloud, Oracle, and Alibaba

What problem does disaggregation actually solve?

How is SMG architecture structured?

What does SMG offer development teams?

Why is this important for the open-source LLM ecosystem?

Frequently Asked Questions

Sources

Related news