arXiv:2606.23181: DART — training-free adaptive thinking in hybrid reasoning models
DART is a routing method that decides without any training whether an AI model needs to think deeply or can respond immediately — reducing thinking token consumption by 15–69% while simultaneously improving accuracy by up to +22.5 points on code benchmarks.
This article was generated using artificial intelligence from primary sources.
Hybrid reasoning models and the token waste problem
Modern hybrid reasoning models — such as Claude 3.7 Sonnet or QwQ — can choose between two operating modes: a short direct response or a long reasoning chain with so-called thinking tokens (intermediate reasoning steps visible only to the model). The problem is that models often spend costly thinking tokens even on trivial questions, unnecessarily slowing down inference and increasing costs.
Researchers from Korea University and affiliated institutions have introduced DART (Draft-Agreement Routing for Training-Free Adaptive Thinking Budgets) — a method that changes this without a single additional training step.
How does DART decide whether the model needs to think?
The idea is elegantly simple: DART first generates two cheap “no-think” drafts (short responses without extended reasoning). If they agree → the model returns the response directly. If they disagree, DART measures the entropy of the disagreement and dynamically computes how large a thinking budget (the maximum number of tokens for deeper reasoning) is actually needed — greater disagreement means a larger budget.
This approach entirely bypasses the need for labeled data or gradient updates, making it applicable to models ranging from 0.6B to 32B parameters, including API-only access without insight into internal architecture.
Results: fewer tokens, higher accuracy
Experimental results clearly favor the method. On olympiad-level math benchmarks, DART achieves up to +9.0 accuracy points while reducing thinking tokens by 15 to 69% compared to baseline models that always use the full thinking budget.
On code writing tasks the gain is even more pronounced: +22.5 accuracy points with a token consumption reduction of 51 to 63%. Compared to a fixed thinking budget — the standard approach where the model always spends the same number of tokens regardless of task difficulty — DART offers a better accuracy-to-cost ratio across all tested scenarios.
Why does this matter for production systems?
Thinking tokens are not free: in API models they are billed per token and directly affect latency. DART opens a path toward inference systems that spend expensive resources only when the difficulty of the query justifies it — without fine-tuning or a new model. The code is publicly available, and the method is model-agnostic, meaning it can be applied to various hybrid reasoning systems without modifying the models themselves.
Frequently Asked Questions
- Does DART require additional training or labeled data?
- No — DART is a training-free method that works purely based on agreement between two cheap drafts, without gradient updates, labeled examples, or access to internal model weights.
- On which models and sizes does DART work?
- DART has been tested on models ranging from 0.6B to 32B parameters across different model families, and works even in API-only settings without access to internal architecture.
Sources
Related news
Mistral: OCR 4 — structured document extraction with bounding boxes in 170 languages
PyTorch/SGLang: DeepSeek-V4 Pro on NVIDIA GB300 — 5× higher throughput with the same interactivity
arXiv:2606.20560: DiffusionGemma as interpretable as Gemma 4 — 28.6× gap reduced to 1.1×