Speculative decoding uses a small draft model to propose tokens and the main model to verify them in parallel, producing higher decode throughput when acceptance rate is high. On the RTX 5060 Ti 16GB at our dedicated GPU hosting, speculative decoding is useful – but the 16 GB VRAM budget is tight, so the draft model has to be small and well-chosen.
Contents
How It Works
The draft model generates N candidate tokens cheaply. The main model then runs a single forward pass that verifies all N tokens at once – same latency as generating one token normally. Accepted tokens ship; the first rejected token is replaced by the main model’s output. Net effect: if acceptance is high and the draft is fast, you output multiple tokens per main-model forward pass.
Speed-up is bounded by acceptance rate. At 80% acceptance with N=4 draft tokens, expected tokens per main forward ~= 3.4 vs. 1.0 without speculative decoding – a 3.4x theoretical uplift, usually 1.8-2.5x in practice after draft overhead.
Pairings
Draft model must share tokenizer with the main model and should be ~10-20x smaller. Recommended pairings for 16 GB:
| Main | Draft | Main VRAM (FP8) | Draft VRAM (FP8) | Fits 16 GB? |
|---|---|---|---|---|
| Llama 3.1 8B | Llama 3.2 1B | ~8 GB | ~1.2 GB | Yes, with headroom |
| Llama 3.1 8B | Llama 3.2 3B | ~8 GB | ~3 GB | Yes, tighter |
| Qwen 2.5 14B (AWQ) | Qwen 2.5 0.5B | ~9 GB | ~0.7 GB | Yes |
| Qwen 2.5 14B (AWQ) | Qwen 2.5 1.5B | ~9 GB | ~2 GB | Yes, tight |
| Mistral Nemo 12B | TinyLlama 1.1B | Skip – tokenizer mismatch | ||
Llama 3.2 1B paired with Llama 3.1 8B is the most balanced setup for this card – identical tokenizer, low draft VRAM, excellent acceptance on code and chat workloads.
vLLM Configuration
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--quantization fp8 \
--speculative-model meta-llama/Llama-3.2-1B-Instruct \
--num-speculative-tokens 5 \
--max-model-len 8192 \
--gpu-memory-utilization 0.90
Key knobs:
--num-speculative-tokens: 4-6 is the sweet spot. Higher means more wasted work on rejection.--speculative-draft-tensor-parallel-size: default 1, fine on single GPU--spec-decoding-acceptance-method:rejection_sampler(default) ortypical_acceptance_sampler(higher throughput, small quality cost)
Measured Speed-Ups
On the 5060 Ti 16GB at batch 1, Llama 3.1 8B main + Llama 3.2 1B draft, N=5:
| Workload | Baseline t/s | Speculative t/s | Uplift | Acceptance |
|---|---|---|---|---|
| Chat (general) | 105 | 190 | 1.81x | 78% |
| Code completion | 105 | 220 | 2.10x | 86% |
| Structured JSON | 105 | 245 | 2.33x | 92% |
| Creative writing | 105 | 150 | 1.43x | 62% |
| Translation | 105 | 175 | 1.67x | 72% |
Structured outputs win biggest because they are highly predictable; creative writing wins least because token distribution is wider. Most production chat workloads land near 1.8x.
Pitfalls
- Batching kills the win. Speculative decoding helps single-user/low-batch workloads most. At batch >4 on a small GPU the decode is already compute-bound, so speculative overhead eats the gain.
- Tokenizer mismatch is silent death. Different tokenizers means 0% acceptance – server still runs but is slower than baseline.
- Draft VRAM compresses KV cache. Dropping 2 GB of KV cache to fit a draft model may cost you 10k tokens of context length. Measure the trade.
- Cold-start doubles. Two model loads means longer startup – use fast NVMe storage to minimise this.
When to Enable
Enable speculative decoding when:
- Batch 1 to 4 serving (interactive chat, coding assistants)
- Structured or code-heavy workloads (high acceptance)
- You have VRAM headroom (16 GB is enough for 8B+1B combos)
Skip it when you are serving high concurrency (vLLM’s chunked prefill and prefix caching will matter more), when you need maximum context length, or when you run quantised 14B via AWQ at tight memory.
Speculative Decoding on Blackwell 16GB
Run Llama 3.1 8B with Llama 3.2 1B draft at ~190 t/s on single-user. UK dedicated hosting.
Order the RTX 5060 Ti 16GBSee also: FP8 Llama deployment, FP8 KV cache, coding assistant use case.