🤖 24 AI
🟢 🏥 In Practice Sunday, April 19, 2026 · 2 min read

RACER: Training-Free Method That Doubles LLM Inference Speed by Combining Retrieval and Logits Draft Strategies

Editorial illustration: parallel token streams flowing faster through a verification channel

Why it matters

RACER is a training-free method for accelerating large language models that combines retrieval-based and logits-based drafting strategies for speculative decoding. It achieves more than 2× speedup over autoregressive decoding, outperforms all previous training-free methods, and has been accepted to ACL 2026 Findings. It was evaluated on Spec-Bench, HumanEval, and MGSM-ZH benchmarks.

What Is Speculative Decoding and Why Does It Matter?

Speculative decoding is a technique for accelerating large language models in which a smaller, faster “draft” model proposes several upcoming tokens at once, and the large main model then verifies them in a single forward pass. If the proposals are correct, the main model skips most of the autoregressive generation steps — yielding speedup without quality loss.

The problem is that the quality of the draft model limits the speedup: if the draft is frequently wrong, verification rejects it and the benefit is lost. The traditional approach requires either training an additional draft model or complex heuristics.

How Does RACER Work?

RACER (Retrieval-Augmented Contextual Rapid Speculative Decoding) combines two draft strategies that complement each other:

  1. Retrieval-based drafting — for parts of the response that are routine or appear in training data, RACER retrieves similar sequences from a corpus and uses them as the draft. The authors call these “reliable anchors” — for predictable segments, retrieval delivers accurate proposals.

  2. Logits-based drafting — for more creative or less predictable parts, RACER uses the model’s own logit probabilities to generate the draft. The authors call this “flexible extrapolation” — for situations where retrieval is unreliable.

Critically, the entire method works without any additional training — it is applied to an existing model and immediately delivers speedup.

How Much Faster Is It Really?

Across three benchmarks, the results are consistent:

  • Spec-Bench: >2× speedup over autoregressive baseline
  • HumanEval (code generation): >2× speedup
  • MGSM-ZH (math in Chinese): >2× speedup

RACER outperforms all previous training-free speculative decoding methods, including standalone retrieval-based and logits-based approaches. The combination delivers a larger boost because it covers different generation regimes.

What Can Developers Use Right Away?

RACER has been accepted to ACL 2026 Findings, which means the code will very likely be available in the official repository. For engineers running their own LLM inference servers (vLLM, llama.cpp, TensorRT-LLM), a method like this means:

  • 2× faster generation without reconfiguring the model
  • No training costs — no LoRA, RLHF, or additional draft model needed
  • Compatibility with existing quantizations and optimizations

For production LLM workloads (customer support, code assistants, batch inference), a 2× speedup translates directly into half the GPU costs at the same throughput.

🤖

This article was generated using artificial intelligence from primary sources.