PyTorch and Meta: over 90 percent effective training time through 40+ optimizations, MegaCache cuts PT2 compile time by 40 percent
Why it matters
Meta has published how it achieved over 90 percent Effective Training Time (ETT) for offline training of its recommendation models. The method includes more than 40 new optimizations in the PyTorch ecosystem, MegaCache which cuts PT2 compilation time by 40 percent, standalone model publishing saving 30 minutes per job, and async checkpointing. Improvements are open-sourced through PyTorch and TorchRec.
A team of 16 engineers at Meta, along with contributors from the PyTorch project, published on April 17, 2026 a detailed post on how they achieved over 90 percent Effective Training Time (ETT) for offline training of recommendation models by the end of 2025. The article is a joint work by Ruilin Chen, Yuzhen Huang, Hang Qi and others, and contains a list of 40+ concrete optimizations developed in the process.
What ETT is and why it matters
Meta introduces a new metric — Effective Training Time (ETT%) — that measures what percentage of total end-to-end wall time actually goes toward productive training.
The formula is simple:
ETT% = 100% - Idleness% - Failure%
Meta breaks ETT into three sub-metrics:
- Time to Start — from hardware allocation to first batch consumption
- Time to Recover — how long restart and resume takes after failures
- Number of Failures — total number of infra-related interruptions
The reason this matters: the classic metric Model FLOPs Utilization (MFU) only measures efficiency within training, but ignores everything that happens before, between and after it. At scale, “in-between” phases become the dominant cost.
MegaCache: 40 percent less compilation
One of the main innovations is MegaCache — a consolidated caching system for PT2 (PyTorch 2.0) components:
- Inductor cache
- Triton bundler
- AOT Autograd
- Dynamo PGO
- Autotune settings
Meta merged them into a single cache that is built once and reused across subsequent jobs.
Result: ~40 percent reduction in average PT2 compile time by the end of 2025. Additional benefits include reduced demands on remote servers, faster model setup and more reliable startup for retried jobs.
Checkpoint management
Checkpointing is critical for recovery, but usually blocks training. Meta worked on three fronts:
Async checkpointing:
- Creates a CPU memory copy of the checkpoint
- The main trainer continues while a background process uploads
- Reduces GPU idle time
PyTorch native staging:
- Replaced the custom C++ staging implementation
- Uses new PyTorch native APIs
- Trade-off: higher trainer memory for less blocking time
Interval optimization:
- Unsaved Training Time = (# failures) × (checkpoint interval) / 2
- Checkpoint Save Blocking Time = (train loop time) / (checkpoint interval) × (blocking time per checkpoint)
- The optimal interval minimizes total lost time
Standalone model publishing: 30 minutes less
Classic flow: training finishes, then the same GPU still performs model publishing (export to production format, validation, upload).
Meta separated publishing from training:
- Training creates an anchor checkpoint
- A separate CPU-based standalone job publishes the model in parallel
Result: ~30 minutes less per job. For companies running hundreds of training jobs per day, that is hundreds of hours per month.
Trainer initialization
Communication optimizations:
- Eliminated unnecessary process group creations
- Removed unnecessary all_gather calls for metadata
- Instead, global rank metadata is built locally after the sharding plan broadcast
Pipeline optimizations:
- Parallelization of independent initialization phases
- PT2 compilation overlaps with DDP warm-up using “fast batch” data
- Particularly useful for foundation models with long data loading times
Failure reduction
Meta identified two main causes of failures:
- Job preemptions (more concurrent jobs = more conflicts)
- Service regressions
Their response is two-pronged: collaboration with infra teams on new scheduling algorithms, plus a component-level observability dashboard that displays TTS, TTR, unsaved training time and checkpoint saving time in real time.
Open-source contributions
PyTorch 2.0 improvements:
TORCH_COMPILE_DYNAMIC_SOURCESfor dynamic shape handling- MegaCache end-to-end caching system
- PyTorch native staging APIs
TorchRec improvements:
- Sharding plan optimizations (eliminated all_gather overhead)
- Communication optimization patterns
All available in PyTorch documentation for replication in other organizations.
Message for the industry
The deepest lesson from Meta’s post is a paradigm shift in optimization: from “how do we train each iteration faster” to “how do we reduce everything that isn’t actual training”. While the community focuses on MFU and increasing throughput, Meta shows that a 10 percent ETT gain is equally valuable as a 10 percent MFU gain — and often much easier to achieve through engineering.
For organizations scaling AI training, ETT becomes just as important a metric as MFU.
This article was generated using artificial intelligence from primary sources.
Related news
Anthropic and NEC build Japan's largest AI engineering workforce — Claude for 30,000 NEC employees
AWS: multimodal biological foundation models accelerate drug discovery by 50 percent and diagnostics by 90 percent
CNCF: infrastructure engineer migrated 60+ Kubernetes resources in 30 minutes with the help of an AI agent