🟢 📦 Open Source Published: · 2 min read ·

vLLM: AutoRound quantization arrives in vLLM-Omni for smaller multimodal models

Editorial illustration: AutoRound quantization arrives in vLLM-Omni for smaller multimodal models

vLLM has integrated Intel's AutoRound quantization into vLLM-Omni, enabling W4A16 compression of multimodal and diffusion models. The result is up to a 62 percent smaller checkpoint, with minimal quality loss and faster generation on Intel XPU and NVIDIA graphics cards.

🤖

This article was generated using artificial intelligence from primary sources.

The vLLM project, one of the most widely used open-source engines for serving large language models, has announced the integration of Intel’s AutoRound quantization into its multimodal branch vLLM-Omni. The goal is to make large multimodal and diffusion models small enough to fit on a single graphics card, without a noticeable loss of quality.

What does AutoRound bring?

AutoRound is a method of post-training quantization — a process that compresses an already trained model to lower bit precision. Specifically, it enables the W4A16 mode, where model weights are kept in just 4 bits and activations in 16 bits. AutoRound simultaneously optimizes the rounding and clipping of values through three parameters it learns per tensor, keeping quantization errors under control.

How much smaller do the models actually get?

The most striking example is the Qwen3-Omni-30B-A3B model, whose checkpoint drops from 66 GB to 25 GB — corresponding to a reduction of about 62 percent. The practical consequence matters more than the number: the minimum hardware requirement drops from four graphics cards to just one. This makes multimodal models accessible even to users with more modest equipment.

Does quality suffer, and how much faster is it?

The quality loss remained surprisingly small. In text-to-image generation, only about 1.3 percent of deviation was recorded, while the W4A16 version was even slightly better than the BF16 reference on the OmniBench benchmark. As for speed, the CFG Parallel parallelization delivers 1.55 to 1.67 times faster guided generation compared to the sequential BF16 baseline. Support covers Intel XPU (B60) and NVIDIA graphics cards.

Frequently Asked Questions

What does W4A16 quantization mean?
Model weights are stored in 4 bits, while activations remain in 16 bits. This drastically reduces the model's size while preserving precision during computation.
How much smaller does the model get?
For Qwen3-Omni-30B-A3B, the checkpoint drops from 66 GB to 25 GB, which is up to 62 percent less space for large Omni models.