PyTorch AutoSP: compiler-based tool automatically converts training code into sequence-parallel form for 100k+ token contexts
On April 29, 2026, PyTorch released AutoSP — a compiler-based tool within the DeepSpeed/DeepCompile ecosystem that automatically converts standard single-GPU transformer training code into a sequence-parallel variant. It eliminates the need to manually implement token partitioning and communication collective operations for training LLMs with 100k+ token contexts. Developed by UIUC SSAIL Lab, Anyscale, and Snowflake.
On April 29, 2026, the PyTorch team, together with researchers from UIUC SSAIL Lab, Anyscale, and Snowflake, released AutoSP — a compiler-based tool within the DeepSpeed/DeepCompile ecosystem that automatically converts standard single-GPU transformer training code into a sequence-parallel variant. The goal: enable training of LLMs with extremely long contexts (100k+ tokens) without manually implementing distributed code.
The problem AutoSP solves
Training LLMs with long contexts requires splitting sequences across multiple GPUs (sequence parallelism, SP), because transformer activations scale quadratically with context length and easily cause out-of-memory errors. Existing solutions (RingFlashAttention, DeepSpeed-Ulysses) require manual rewriting of training code — token partitioning, communication collectives, complex attention masking.
AutoSP automates all of that: the user writes standard transformer code, and the compiler converts it into an SP-aware variant.
How to enable it
Three lines of configuration in the DeepSpeed config:
config = {
"compile": {
"deepcompile": True,
"passes": ["autosp"]
},
"sequence_parallel_size": 4,
"zero_optimization": {"stage": 1}, # AutoSP composes with ZeRO 0/1
...
}
Plus the prepare_auto_sp_inputs() utility for preparing inputs. Under the hood, the strategy uses the DeepSpeed-Ulysses architecture — constant communication overhead as the number of GPUs grows on NVLink/fat-tree networks, scaling up to the number of attention heads (e.g., 32 heads in 7–8B models).
Sequence-Aware Activation Checkpointing
AutoSP also introduces SAC — a custom checkpointing strategy optimized for long-context training. Unlike the conservative PyTorch 2.0 max-flow min-cut formulation, SAC exploits the specific FLOP dynamics of long contexts — it frees intermediate activations from cheap-to-compute operators and recomputes them during the backward pass. Trade-off: marginally reduces throughput, but makes even longer contexts feasible.
Results
Tested on 8× A100-80GB SXM nodes (PyTorch 2.7, CUDA 12.8) with Llama 3.1 models of various sizes:
- Maximum trainable sequence length increases significantly with the same resources
- Runtime overhead is minimal compared to hand-written RingFlashAttention and DeepSpeed-Ulysses baselines
End-to-end examples (including Llama 3.1 8B) are available at github.com/deepspeedai/DeepSpeedExamples/tree/master/benchmarks/autosp.
Limitations
Currently, AutoSP requires a single compilable artifact (the entire transformer must be compiled as one block) and does not allow graph breaks within the model. The team notes extending support for graph-break resilience as the next step.
Frequently Asked Questions
- What does AutoSP do?
- It automatically converts standard single-GPU transformer training code into sequence-parallel (SP) code, enabling LLM training with 100k+ token contexts across multiple GPUs. Eliminates manual token partitioning and communication collective operations; integrated with DeepSpeed/DeepCompile.
- How do you enable it?
- In your DeepSpeed config, set `'deepcompile': True`, add `'passes': ['autosp']`, use the `prepare_auto_sp_inputs()` utility, and set `'sequence_parallel_size'`. It composes with ZeRO stage 0/1.
- How does performance compare to hand-written SP?
- On 8× A100-80GB with Llama 3.1 models, AutoSP achieves throughput comparable to hand-written RingFlashAttention and DeepSpeed-Ulysses implementations, with minimal runtime overhead. Maximum trainable sequence length increases significantly for the same resources.
This article was generated using artificial intelligence from primary sources.
Sources
Related news
Anthropic closes 1M context beta for Sonnet 4.5 and Sonnet 4 — migration to 4.6 required
AstaBench Spring 2026: Claude Opus 4.7 leads with 58% in scientific AI benchmark, GPT-5.5 half the cost
PyTorch SMG: CPU-GPU disaggregation in LLM serving delivers 3.5× output throughput for Llama 3.3 70B FP8, already in production on Google Cloud, Oracle, and Alibaba