PyTorch 2.12: Device-Agnostic Graph API + MX Quantization

PyTorch 2.12 is the new production release of the PyTorch framework published on May 13, 2026, with 2,926 commits and 457 contributors. Key features: torch.accelerator.Graph device-agnostic API for CUDA, XPU and out-of-tree backends, torch.export support for Microscaling MX quantization (MXFP4/6/8), linalg.eigh up to 100× faster on CUDA via cuSolver, and torch.cond inside CUDA Graphs. TorchScript has been formally removed.

PyTorch Foundation published version 2.12 of the framework on May 13, 2026 — a release with 2,926 commits and 457 contributors. The version brings a shift to a multi-vendor accelerator API, support for aggressive quantization, and significant speedups in linear algebra operations, alongside the formal removal of TorchScript.

How does torch.accelerator.Graph change graph capture?

torch.accelerator.Graph is the new unified API for graph capture and replay that works across CUDA, XPU and out-of-tree backends. It replaces device-specific implementations such as torch.xpu.XPUGraph. Backends register via a lightweight GraphImplInterface, and c10::Stream and torch.Stream gain a new is_capturing() method for backend-agnostic stream checking. The implementation was contributed by Guangye Yu (Intel) through PRs #171269 and #171285.

What does MX quantization in torch.export enable?

torch.export.save and torch.export.load now support the float8_e8m0fnu dtype. The change enables full export of aggressively compressed models in MXFP4, MXFP6 and MXFP8 formats — critical for deploying LLMs in cost-constrained and edge environments. The contribution came from Chizkiyahu Raful (ARM) through PR #176270.

What speedups does 2.12 deliver?

linalg.eigh has been migrated from the legacy MAGMA backend to cuSolver, using syevj_batched unconditionally for batched operations. The PyTorch team reports up to 100× speedup on CUDA in typical ML workloads — operations that used to take minutes now complete in seconds. Another key optimization: torch.cond data-dependent control flow can now be captured inside CUDA Graphs via CUDA 12.4 conditional IF nodes, eliminating the previous fallback to CUDA graph trees. The fused Adagrad optimizer joins Adam, AdamW and SGD with fused=True support.

What does removing TorchScript mean?

TorchScript was deprecated since 2.10 and formally removed in 2.12. The recommended replacement is torch.export for model serialization and Executorch for embedded runtime. The CUDA 12.8 wheel is no longer published in the standard release matrix — PyTorch recommends CUDA 12.6 for older architectures (Pascal, Volta) and CUDA 13.0+ for Blackwell.

A live Q&A event with panelists Joe Spisak, Andrey Talman and Alban Desmaison is scheduled for Wednesday, May 20, 2026 at 10 AM PST.

Frequently Asked Questions

What is torch.accelerator.Graph?

A unified API for graph capture and replay across CUDA, XPU and out-of-tree backends, replacing device-specific implementations such as torch.xpu.XPUGraph; backends register via a lightweight GraphImplInterface, and c10::Stream gains an is_capturing() method for backend-agnostic stream checking.

What does removing TorchScript mean?

TorchScript was deprecated since version 2.10 and formally removed in 2.12; the recommended replacement is torch.export for model serialization and Executorch for embedded runtime — existing production code must migrate before upgrading to 2.12+.

PyTorch: Version 2.12 brings device-agnostic torch.accelerator.Graph, MX quantization and 100× faster linalg.eigh

How does torch.accelerator.Graph change graph capture?

What does MX quantization in torch.export enable?

What speedups does 2.12 deliver?

What does removing TorchScript mean?

Frequently Asked Questions

Sources

Related news