What does training-free mean?

The method can be applied to existing models without any additional training or fine-tuning, which is practical because developers immediately gain speedup without GPU costs.

RACER: Training-Free Method That Doubles LLM Inference Speed by Combining Retrieval and Logits Draft Strategies

Q: What is speculative decoding?

An acceleration technique where a smaller, faster model 'guesses' upcoming tokens, and the large model then verifies them in a single forward pass. When guesses are correct, generation time is dramatically reduced.

What Is Speculative Decoding and Why Does It Matter?

Speculative decoding is a technique for accelerating large language models in which a smaller, faster “draft” model proposes several upcoming tokens at once, and the large main model then verifies them in a single forward pass. If the proposals are correct, the main model skips most of the autoregressive generation steps — yielding speedup without quality loss.

The problem is that the quality of the draft model limits the speedup: if the draft is frequently wrong, verification rejects it and the benefit is lost. The traditional approach requires either training an additional draft model or complex heuristics.

How Does RACER Work?

RACER (Retrieval-Augmented Contextual Rapid Speculative Decoding) combines two draft strategies that complement each other:

Retrieval-based drafting — for parts of the response that are routine or appear in training data, RACER retrieves similar sequences from a corpus and uses them as the draft. The authors call these “reliable anchors” — for predictable segments, retrieval delivers accurate proposals.
Logits-based drafting — for more creative or less predictable parts, RACER uses the model’s own logit probabilities to generate the draft. The authors call this “flexible extrapolation” — for situations where retrieval is unreliable.

Critically, the entire method works without any additional training — it is applied to an existing model and immediately delivers speedup.

How Much Faster Is It Really?

Across three benchmarks, the results are consistent:

Spec-Bench: >2× speedup over autoregressive baseline
HumanEval (code generation): >2× speedup
MGSM-ZH (math in Chinese): >2× speedup

RACER outperforms all previous training-free speculative decoding methods, including standalone retrieval-based and logits-based approaches. The combination delivers a larger boost because it covers different generation regimes.

What Can Developers Use Right Away?

RACER has been accepted to ACL 2026 Findings, which means the code will very likely be available in the official repository. For engineers running their own LLM inference servers (vLLM, llama.cpp, TensorRT-LLM), a method like this means:

2× faster generation without reconfiguring the model
No training costs — no LoRA, RLHF, or additional draft model needed
Compatibility with existing quantizations and optimizations

For production LLM workloads (customer support, code assistants, batch inference), a 2× speedup translates directly into half the GPU costs at the same throughput.

RACER: Training-Free Method That Doubles LLM Inference Speed by Combining Retrieval and Logits Draft Strategies

What Is Speculative Decoding and Why Does It Matter?

How Does RACER Work?

How Much Faster Is It Really?

What Can Developers Use Right Away?

Sources

Related news