What exactly does the ETT metric measure?

Effective Training Time is the percentage of total end-to-end wall time that actually goes toward productive training (consuming new data). ETT% = 100% - Idleness% - Failure%. The goal is to minimise time spent in initialization, error recovery and checkpointing.

Which PyTorch tools are open-sourced?

TORCH_COMPILE_DYNAMIC_SOURCES for dynamic shapes, the MegaCache end-to-end caching system, PyTorch native staging APIs for async checkpointing, TorchRec sharding optimizations.

What is the biggest bottleneck in modern training?

No longer FLOPs utilization but the 'in-between' phases outside steady-state training — startup, checkpointing, failure recovery. The GPU sits idle during those phases, and that becomes the dominant cost at scale.

PyTorch and Meta: over 90 percent effective training time through 40+ optimizations, MegaCache cuts PT2 compile time by 40 percent

A team of 16 engineers at Meta, along with contributors from the PyTorch project, published on April 17, 2026 a detailed post on how they achieved over 90 percent Effective Training Time (ETT) for offline training of recommendation models by the end of 2025. The article is a joint work by Ruilin Chen, Yuzhen Huang, Hang Qi and others, and contains a list of 40+ concrete optimizations developed in the process.

What ETT is and why it matters

Meta introduces a new metric — Effective Training Time (ETT%) — that measures what percentage of total end-to-end wall time actually goes toward productive training.

The formula is simple:

ETT% = 100% - Idleness% - Failure%

Meta breaks ETT into three sub-metrics:

Time to Start — from hardware allocation to first batch consumption
Time to Recover — how long restart and resume takes after failures
Number of Failures — total number of infra-related interruptions

The reason this matters: the classic metric Model FLOPs Utilization (MFU) only measures efficiency within training, but ignores everything that happens before, between and after it. At scale, “in-between” phases become the dominant cost.

MegaCache: 40 percent less compilation

One of the main innovations is MegaCache — a consolidated caching system for PT2 (PyTorch 2.0) components:

Inductor cache
Triton bundler
AOT Autograd
Dynamo PGO
Autotune settings

Meta merged them into a single cache that is built once and reused across subsequent jobs.

Result: ~40 percent reduction in average PT2 compile time by the end of 2025. Additional benefits include reduced demands on remote servers, faster model setup and more reliable startup for retried jobs.

Checkpoint management

Checkpointing is critical for recovery, but usually blocks training. Meta worked on three fronts:

Async checkpointing:

Creates a CPU memory copy of the checkpoint
The main trainer continues while a background process uploads
Reduces GPU idle time

PyTorch native staging:

Replaced the custom C++ staging implementation
Uses new PyTorch native APIs
Trade-off: higher trainer memory for less blocking time

Interval optimization:

Unsaved Training Time = (# failures) × (checkpoint interval) / 2
Checkpoint Save Blocking Time = (train loop time) / (checkpoint interval) × (blocking time per checkpoint)
The optimal interval minimizes total lost time

Standalone model publishing: 30 minutes less

Classic flow: training finishes, then the same GPU still performs model publishing (export to production format, validation, upload).

Meta separated publishing from training:

Training creates an anchor checkpoint
A separate CPU-based standalone job publishes the model in parallel

Result: ~30 minutes less per job. For companies running hundreds of training jobs per day, that is hundreds of hours per month.

Trainer initialization

Communication optimizations:

Eliminated unnecessary process group creations
Removed unnecessary all_gather calls for metadata
Instead, global rank metadata is built locally after the sharding plan broadcast

Pipeline optimizations:

Parallelization of independent initialization phases
PT2 compilation overlaps with DDP warm-up using “fast batch” data
Particularly useful for foundation models with long data loading times

Failure reduction

Meta identified two main causes of failures:

Job preemptions (more concurrent jobs = more conflicts)
Service regressions

Their response is two-pronged: collaboration with infra teams on new scheduling algorithms, plus a component-level observability dashboard that displays TTS, TTR, unsaved training time and checkpoint saving time in real time.

Open-source contributions

PyTorch 2.0 improvements:

TORCH_COMPILE_DYNAMIC_SOURCES for dynamic shape handling
MegaCache end-to-end caching system
PyTorch native staging APIs

TorchRec improvements:

Sharding plan optimizations (eliminated all_gather overhead)
Communication optimization patterns

All available in PyTorch documentation for replication in other organizations.

Message for the industry

The deepest lesson from Meta’s post is a paradigm shift in optimization: from “how do we train each iteration faster” to “how do we reduce everything that isn’t actual training”. While the community focuses on MFU and increasing throughput, Meta shows that a 10 percent ETT gain is equally valuable as a 10 percent MFU gain — and often much easier to achieve through engineering.

For organizations scaling AI training, ETT becomes just as important a metric as MFU.