AdaMeZO: Adam LLM optimization without memory overhead

AdaMeZO is a zeroth-order optimizer that combines the advantages of the Adam algorithm with the memory efficiency of the MeZO approach for fine-tuning large language models. It uses only forward passes and achieves up to 70% fewer passes compared to MeZO, with improved convergence.

Researchers Zhijie Cai, Haolong Chen, and Guangxu Zhu have introduced AdaMeZO, a zeroth-order optimizer that brings the benefits of the popular Adam algorithm to large language model fine-tuning — without requiring gradient moments to be stored in GPU memory.

Why is GPU memory a bottleneck for LLM fine-tuning?

The standard Adam optimizer, routinely used for neural network training, tracks two statistics for each model parameter: the first moment (running mean of gradients) and the second moment (running mean of squared gradients). For models with billions of parameters, this triples memory requirements. MeZO, a prior approach that uses only forward passes without computing true gradients, solves the memory problem — but converges more slowly because it lacks per-parameter adaptive learning rate adjustment.

How does AdaMeZO combine both approaches?

AdaMeZO estimates Adam’s moments without storing them persistently: it uses random weight perturbations and measures loss changes to reconstruct adaptive moment behavior on-the-fly, separately for each optimization step. The result is an optimizer that behaves like Adam — adjusting the learning rate according to the estimated loss surface geometry — while keeping the same memory footprint as MeZO.

What do the trajectory visualizations show?

The authors present optimization trajectory visualizations on various loss surfaces demonstrating how AdaMeZO adaptively navigates flat and curved regions of parameter space, in contrast to MeZO’s more uniform behavior. Quantitatively, AdaMeZO achieves up to 70% fewer forward passes to reach the same performance level, with improved convergence compared to the original MeZO.

Frequently Asked Questions

What is a zeroth-order optimizer and what is it used for?

A zeroth-order optimizer estimates gradients solely via forward passes, without computing true backpropagation gradients. This drastically reduces GPU memory requirements because gradients and optimizer states do not need to be stored.

Why couldn't Adam directly replace SGD in the MeZO approach?

Directly applying Adam to MeZO would triple memory requirements because Adam tracks first and second gradient moments for every parameter. AdaMeZO bypasses this by estimating moments on-the-fly without storing them persistently.

How much more efficient is AdaMeZO compared to MeZO?

AdaMeZO achieves up to 70% fewer forward passes compared to standard MeZO at the same memory footprint, meaning faster convergence under identical memory constraints.

AdaMeZO: Adam-style LLM fine-tuning without storing gradient moments in GPU memory

Why is GPU memory a bottleneck for LLM fine-tuning?

How does AdaMeZO combine both approaches?

What do the trajectory visualizations show?

Frequently Asked Questions

Sources

Related news