Home / Blog / Tutorials / vLLM Speculative Decoding Setup – Faster Tokens, Same Model

Tutorials

vLLM Speculative Decoding Setup – Faster Tokens, Same Model

Speculative decoding uses a small draft model to speed up a larger one. Configured right on a dedicated GPU it delivers 1.5-2x decode speed for free.

Tutorials April 19, 2026 2 min read gigagpu

Speculative decoding is one of the few genuine free lunches in LLM serving. A small draft model proposes several tokens; the large target model verifies them in one forward pass. If the draft was mostly right, you generated several tokens for the cost of one. On dedicated GPU servers the speed-up is typically 1.5-2x for free.

How it works
Pairing models
vLLM setup
Caveats

How It Works

The draft model (e.g. Llama 3.2 1B) proposes k tokens. The target model (e.g. Llama 3.1 70B) runs its forward pass once and evaluates all k proposed tokens simultaneously. Accepted tokens are kept; the first rejected one forces regeneration from that point. Net effect: often 1.5-2x more tokens per target-model forward pass.

Model Pairing

The draft and target must share a tokeniser. Good pairings in 2026:

Target	Good Draft
Llama 3 70B	Llama 3.2 1B or 3B
Qwen 2.5 72B	Qwen 2.5 0.5B or 1.5B
Mistral Large 2	Mistral 7B (larger but still much smaller than target)
Llama 3 8B	Usually not worth it – target is already fast

Setup in vLLM

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --speculative-model meta-llama/Llama-3.2-1B-Instruct \
  --num-speculative-tokens 5 \
  --use-v2-block-manager

--num-speculative-tokens 5 means the draft proposes 5 tokens per step. 3-7 is typical. Higher values waste work when acceptance rate is low; lower values cap the speed-up.

Requires VRAM for both models. On a 6000 Pro serving Llama 3 70B INT4, the 1B draft fits comfortably with room for KV cache on both.

Caveats

Three things to know:

Acceptance rate matters. On creative generation (diverse outputs), acceptance can be 40-50%, meaning less speed-up. On factual Q&A, acceptance is often 70-80%.
At very high batch sizes, speculative decoding can underperform because the target model was already batch-saturated without it.
Draft VRAM cost reduces available KV cache. If you were near your VRAM ceiling, you may need to lower max-model-len.

Speculative Decoding Preconfigured

We set up draft and target model pairings on UK dedicated hosting, tuned for your workload.

Browse GPU Servers

See continuous batching tuning and prefix caching for other free-lunch wins.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Tutorials

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

vLLM Speculative Decoding Setup – Faster Tokens, Same Model

Contents

How It Works

Model Pairing

Setup in vLLM

Caveats

Speculative Decoding Preconfigured

Need a Dedicated GPU Server?

gigagpu

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

vLLM Speculative Decoding Setup – Faster Tokens, Same Model

Contents

How It Works

Model Pairing

Setup in vLLM

Caveats

Speculative Decoding Preconfigured

Need a Dedicated GPU Server?

gigagpu

Related Articles

How to Optimise vLLM Memory Usage for Maximum Throughput

Connect Google Sheets to Self-Hosted AI on GPU

Async Agent Execution

Self-Hosted AI Incident Postmortem Template

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?