🟡 🏥 In Practice Published: · 2 min read ·

arXiv:2606.25519: Quantization inflates reasoning — the hidden cost of low-bit models

arXiv:2606.25519 ↗

Editorial illustration: schematic compressed brain emitting a long chain of thought bubbles, symbolizing token inflation in quantized AI models

Quantizing language models to INT4/INT3 preserves answer accuracy but lengthens the chain of thought and cancels the expected inference speedup. Microsoft researchers introduced the CoT Token Inflation Ratio metric and tested it on math, code, science, and agentic tasks.

🤖

This article was generated using artificial intelligence from primary sources.

What is quantization and why is it used?

Quantization — the process of reducing the bit precision of model weights from 16 or 32 bits to INT4 or INT3 — is a standard technique for speeding up inference and reducing the memory footprint of large language models. Microsoft researchers (7 authors, paper published June 24, 2026) reveal that this technique carries a hidden cost that previous evaluations have not measured.

What is the real cost of low-bit models?

Quantization to INT4 or INT3 precision preserves the accuracy of the final answer, but causes significant lengthening of the chain of thought (the sequence of intermediate steps a model generates before the final answer). Quantized models produce more intermediate steps and semantic repetitions than their full-precision equivalents, completely negating the per-token speedup through the increased number of generated tokens.

A new metric: CoT Token Inflation Ratio

The researchers introduced the CoT Token Inflation Ratio metric, which measures the ratio of chain-of-thought length between a quantized and the original model. Testing was conducted across four task categories: mathematical reasoning, code generation, scientific Q&A, and agentic tool-use (tasks involving tool calls). In all categories, quantization increases reasoning token consumption.

Solution: training, not prompting

Comparing three mitigation approaches — prompting strategies, sampling techniques, and quantization-aware training — the authors conclude that only training that is aware of quantization simultaneously reduces both accuracy loss and token inflation. Prompting and sampling mitigations proved insufficient.

The practical implication: evaluations of quantized reasoning models must report token consumption during reasoning alongside accuracy, since these are two separate costs that together determine true efficiency.

Frequently Asked Questions

Why does quantization lengthen the chain of thought?
Low-bit precision introduces small numerical errors into model weights, causing the model to generate more intermediate steps and semantic repetitions to compensate for uncertainty — even when it ultimately reaches the correct answer.
How can token inflation in quantized models be reduced?
Quantization-aware training proved most effective: it outperforms both prompting strategies and sampling techniques in reducing both accuracy loss and token inflation.