RTX 3050 - Order Now
Home / Blog / Tutorials / vLLM Speculative Decoding Setup – Faster Tokens, Same Model
Tutorials

vLLM Speculative Decoding Setup – Faster Tokens, Same Model

Speculative decoding uses a small draft model to speed up a larger one. Configured right on a dedicated GPU it delivers 1.5-2x decode speed for free.

Speculative decoding is one of the few genuine free lunches in LLM serving. A small draft model proposes several tokens; the large target model verifies them in one forward pass. If the draft was mostly right, you generated several tokens for the cost of one. On dedicated GPU servers the speed-up is typically 1.5-2x for free.

Contents

How It Works

The draft model (e.g. Llama 3.2 1B) proposes k tokens. The target model (e.g. Llama 3.1 70B) runs its forward pass once and evaluates all k proposed tokens simultaneously. Accepted tokens are kept; the first rejected one forces regeneration from that point. Net effect: often 1.5-2x more tokens per target-model forward pass.

Model Pairing

The draft and target must share a tokeniser. Good pairings in 2026:

TargetGood Draft
Llama 3 70BLlama 3.2 1B or 3B
Qwen 2.5 72BQwen 2.5 0.5B or 1.5B
Mistral Large 2Mistral 7B (larger but still much smaller than target)
Llama 3 8BUsually not worth it – target is already fast

Setup in vLLM

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --speculative-model meta-llama/Llama-3.2-1B-Instruct \
  --num-speculative-tokens 5 \
  --use-v2-block-manager

--num-speculative-tokens 5 means the draft proposes 5 tokens per step. 3-7 is typical. Higher values waste work when acceptance rate is low; lower values cap the speed-up.

Requires VRAM for both models. On a 6000 Pro serving Llama 3 70B INT4, the 1B draft fits comfortably with room for KV cache on both.

Caveats

Three things to know:

  • Acceptance rate matters. On creative generation (diverse outputs), acceptance can be 40-50%, meaning less speed-up. On factual Q&A, acceptance is often 70-80%.
  • At very high batch sizes, speculative decoding can underperform because the target model was already batch-saturated without it.
  • Draft VRAM cost reduces available KV cache. If you were near your VRAM ceiling, you may need to lower max-model-len.

Speculative Decoding Preconfigured

We set up draft and target model pairings on UK dedicated hosting, tuned for your workload.

Browse GPU Servers

See continuous batching tuning and prefix caching for other free-lunch wins.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?