🟡 🤖 Models Thursday, April 30, 2026 · 2 min read ·

PyTorch AutoSP: compiler-based tool automatically converts training code into sequence-parallel form for 100k+ token contexts

Editorial illustration: tokens distributed across multiple GPU cores with a compiler symbol

On April 29, 2026, PyTorch released AutoSP — a compiler-based tool within the DeepSpeed/DeepCompile ecosystem that automatically converts standard single-GPU transformer training code into a sequence-parallel variant. It eliminates the need to manually implement token partitioning and communication collective operations for training LLMs with 100k+ token contexts. Developed by UIUC SSAIL Lab, Anyscale, and Snowflake.

On April 29, 2026, the PyTorch team, together with researchers from UIUC SSAIL Lab, Anyscale, and Snowflake, released AutoSP — a compiler-based tool within the DeepSpeed/DeepCompile ecosystem that automatically converts standard single-GPU transformer training code into a sequence-parallel variant. The goal: enable training of LLMs with extremely long contexts (100k+ tokens) without manually implementing distributed code.

The problem AutoSP solves

Training LLMs with long contexts requires splitting sequences across multiple GPUs (sequence parallelism, SP), because transformer activations scale quadratically with context length and easily cause out-of-memory errors. Existing solutions (RingFlashAttention, DeepSpeed-Ulysses) require manual rewriting of training code — token partitioning, communication collectives, complex attention masking.

AutoSP automates all of that: the user writes standard transformer code, and the compiler converts it into an SP-aware variant.

How to enable it

Three lines of configuration in the DeepSpeed config:

config = {
    "compile": {
        "deepcompile": True,
        "passes": ["autosp"]
    },
    "sequence_parallel_size": 4,
    "zero_optimization": {"stage": 1},  # AutoSP composes with ZeRO 0/1
    ...
}

Plus the prepare_auto_sp_inputs() utility for preparing inputs. Under the hood, the strategy uses the DeepSpeed-Ulysses architecture — constant communication overhead as the number of GPUs grows on NVLink/fat-tree networks, scaling up to the number of attention heads (e.g., 32 heads in 7–8B models).

Sequence-Aware Activation Checkpointing

AutoSP also introduces SAC — a custom checkpointing strategy optimized for long-context training. Unlike the conservative PyTorch 2.0 max-flow min-cut formulation, SAC exploits the specific FLOP dynamics of long contexts — it frees intermediate activations from cheap-to-compute operators and recomputes them during the backward pass. Trade-off: marginally reduces throughput, but makes even longer contexts feasible.

Results

Tested on 8× A100-80GB SXM nodes (PyTorch 2.7, CUDA 12.8) with Llama 3.1 models of various sizes:

  • Maximum trainable sequence length increases significantly with the same resources
  • Runtime overhead is minimal compared to hand-written RingFlashAttention and DeepSpeed-Ulysses baselines

End-to-end examples (including Llama 3.1 8B) are available at github.com/deepspeedai/DeepSpeedExamples/tree/master/benchmarks/autosp.

Limitations

Currently, AutoSP requires a single compilable artifact (the entire transformer must be compiled as one block) and does not allow graph breaks within the model. The team notes extending support for graph-break resilience as the next step.

Frequently Asked Questions

What does AutoSP do?
It automatically converts standard single-GPU transformer training code into sequence-parallel (SP) code, enabling LLM training with 100k+ token contexts across multiple GPUs. Eliminates manual token partitioning and communication collective operations; integrated with DeepSpeed/DeepCompile.
How do you enable it?
In your DeepSpeed config, set `'deepcompile': True`, add `'passes': ['autosp']`, use the `prepare_auto_sp_inputs()` utility, and set `'sequence_parallel_size'`. It composes with ZeRO stage 0/1.
How does performance compare to hand-written SP?
On 8× A100-80GB with Llama 3.1 models, AutoSP achieves throughput comparable to hand-written RingFlashAttention and DeepSpeed-Ulysses implementations, with minimal runtime overhead. Maximum trainable sequence length increases significantly for the same resources.
🤖

This article was generated using artificial intelligence from primary sources.