Speculative decoding is one of the few genuine free lunches in LLM serving. A small draft model proposes several tokens; the large target model verifies them in one forward pass. If the draft was mostly right, you generated several tokens for the cost of one. On dedicated GPU servers the speed-up is typically 1.5-2x for free.
Contents
How It Works
The draft model (e.g. Llama 3.2 1B) proposes k tokens. The target model (e.g. Llama 3.1 70B) runs its forward pass once and evaluates all k proposed tokens simultaneously. Accepted tokens are kept; the first rejected one forces regeneration from that point. Net effect: often 1.5-2x more tokens per target-model forward pass.
Model Pairing
The draft and target must share a tokeniser. Good pairings in 2026:
| Target | Good Draft |
|---|---|
| Llama 3 70B | Llama 3.2 1B or 3B |
| Qwen 2.5 72B | Qwen 2.5 0.5B or 1.5B |
| Mistral Large 2 | Mistral 7B (larger but still much smaller than target) |
| Llama 3 8B | Usually not worth it – target is already fast |
Setup in vLLM
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-70B-Instruct \
--speculative-model meta-llama/Llama-3.2-1B-Instruct \
--num-speculative-tokens 5 \
--use-v2-block-manager
--num-speculative-tokens 5 means the draft proposes 5 tokens per step. 3-7 is typical. Higher values waste work when acceptance rate is low; lower values cap the speed-up.
Requires VRAM for both models. On a 6000 Pro serving Llama 3 70B INT4, the 1B draft fits comfortably with room for KV cache on both.
Caveats
Three things to know:
- Acceptance rate matters. On creative generation (diverse outputs), acceptance can be 40-50%, meaning less speed-up. On factual Q&A, acceptance is often 70-80%.
- At very high batch sizes, speculative decoding can underperform because the target model was already batch-saturated without it.
- Draft VRAM cost reduces available KV cache. If you were near your VRAM ceiling, you may need to lower
max-model-len.
Speculative Decoding Preconfigured
We set up draft and target model pairings on UK dedicated hosting, tuned for your workload.
Browse GPU ServersSee continuous batching tuning and prefix caching for other free-lunch wins.