🤖 24 AI
🟡 🏥 In Practice Saturday, April 18, 2026 · 4 min read

PyTorch and Meta: over 90 percent effective training time through 40+ optimizations, MegaCache cuts PT2 compile time by 40 percent

Why it matters

Meta has published how it achieved over 90 percent Effective Training Time (ETT) for offline training of its recommendation models. The method includes more than 40 new optimizations in the PyTorch ecosystem, MegaCache which cuts PT2 compilation time by 40 percent, standalone model publishing saving 30 minutes per job, and async checkpointing. Improvements are open-sourced through PyTorch and TorchRec.

A team of 16 engineers at Meta, along with contributors from the PyTorch project, published on April 17, 2026 a detailed post on how they achieved over 90 percent Effective Training Time (ETT) for offline training of recommendation models by the end of 2025. The article is a joint work by Ruilin Chen, Yuzhen Huang, Hang Qi and others, and contains a list of 40+ concrete optimizations developed in the process.

What ETT is and why it matters

Meta introduces a new metric — Effective Training Time (ETT%) — that measures what percentage of total end-to-end wall time actually goes toward productive training.

The formula is simple:

ETT% = 100% - Idleness% - Failure%

Meta breaks ETT into three sub-metrics:

  1. Time to Start — from hardware allocation to first batch consumption
  2. Time to Recover — how long restart and resume takes after failures
  3. Number of Failures — total number of infra-related interruptions

The reason this matters: the classic metric Model FLOPs Utilization (MFU) only measures efficiency within training, but ignores everything that happens before, between and after it. At scale, “in-between” phases become the dominant cost.

MegaCache: 40 percent less compilation

One of the main innovations is MegaCache — a consolidated caching system for PT2 (PyTorch 2.0) components:

  • Inductor cache
  • Triton bundler
  • AOT Autograd
  • Dynamo PGO
  • Autotune settings

Meta merged them into a single cache that is built once and reused across subsequent jobs.

Result: ~40 percent reduction in average PT2 compile time by the end of 2025. Additional benefits include reduced demands on remote servers, faster model setup and more reliable startup for retried jobs.

Checkpoint management

Checkpointing is critical for recovery, but usually blocks training. Meta worked on three fronts:

Async checkpointing:

  • Creates a CPU memory copy of the checkpoint
  • The main trainer continues while a background process uploads
  • Reduces GPU idle time

PyTorch native staging:

  • Replaced the custom C++ staging implementation
  • Uses new PyTorch native APIs
  • Trade-off: higher trainer memory for less blocking time

Interval optimization:

  • Unsaved Training Time = (# failures) × (checkpoint interval) / 2
  • Checkpoint Save Blocking Time = (train loop time) / (checkpoint interval) × (blocking time per checkpoint)
  • The optimal interval minimizes total lost time

Standalone model publishing: 30 minutes less

Classic flow: training finishes, then the same GPU still performs model publishing (export to production format, validation, upload).

Meta separated publishing from training:

  • Training creates an anchor checkpoint
  • A separate CPU-based standalone job publishes the model in parallel

Result: ~30 minutes less per job. For companies running hundreds of training jobs per day, that is hundreds of hours per month.

Trainer initialization

Communication optimizations:

  • Eliminated unnecessary process group creations
  • Removed unnecessary all_gather calls for metadata
  • Instead, global rank metadata is built locally after the sharding plan broadcast

Pipeline optimizations:

  • Parallelization of independent initialization phases
  • PT2 compilation overlaps with DDP warm-up using “fast batch” data
  • Particularly useful for foundation models with long data loading times

Failure reduction

Meta identified two main causes of failures:

  1. Job preemptions (more concurrent jobs = more conflicts)
  2. Service regressions

Their response is two-pronged: collaboration with infra teams on new scheduling algorithms, plus a component-level observability dashboard that displays TTS, TTR, unsaved training time and checkpoint saving time in real time.

Open-source contributions

PyTorch 2.0 improvements:

  • TORCH_COMPILE_DYNAMIC_SOURCES for dynamic shape handling
  • MegaCache end-to-end caching system
  • PyTorch native staging APIs

TorchRec improvements:

  • Sharding plan optimizations (eliminated all_gather overhead)
  • Communication optimization patterns

All available in PyTorch documentation for replication in other organizations.

Message for the industry

The deepest lesson from Meta’s post is a paradigm shift in optimization: from “how do we train each iteration faster” to “how do we reduce everything that isn’t actual training”. While the community focuses on MFU and increasing throughput, Meta shows that a 10 percent ETT gain is equally valuable as a 10 percent MFU gain — and often much easier to achieve through engineering.

For organizations scaling AI training, ETT becomes just as important a metric as MFU.

🤖

This article was generated using artificial intelligence from primary sources.