Decode is the steady-state phase where the model generates one token per forward pass. It is memory-bandwidth-bound, so throughput scales inversely with weight size. Numbers on the RTX 5060 Ti 16GB (448 GB/s) at our hosting:
Contents
Bandwidth Ceiling
Theoretical decode t/s = bandwidth / weight_bytes. For Llama 3 8B:
- FP16 (16 GB weights): 448 / 16 = 28 t/s theoretical max
- FP8 (8 GB weights): 448 / 8 = 56 t/s – wait, that’s too low?
The formula is right but naive – in practice only a fraction of weights are touched per token due to KV attention and only active layers stream. Empirically you observe 2-4x the naive estimate. Real numbers:
Measured Decode (Batch 1, 128 in / 512 out)
| Model | Precision | Weights | t/s |
|---|---|---|---|
| Phi-3-mini | FP8 | 3.8 GB | 285 |
| Llama 3.2 3B | FP8 | 3.1 GB | 260 |
| Mistral 7B | FP8 | 7.2 GB | 122 |
| Llama 3.1 8B | FP8 | 8.0 GB | 112 |
| Llama 3.1 8B | AWQ INT4 | 5.5 GB | 135 |
| Gemma 2 9B | FP8 | 9.5 GB | 98 |
| Qwen 2.5 14B | AWQ INT4 | 9.0 GB | 70 |
Batch Scaling
Llama 3 8B FP8, aggregate decode t/s as batch increases:
| Batch | t/s aggregate | Scaling factor |
|---|---|---|
| 1 | 112 | 1.0x |
| 2 | 205 | 1.8x |
| 4 | 355 | 3.2x |
| 8 | 510 | 4.6x |
| 16 | 640 | 5.7x |
| 32 | 720 | 6.4x |
| 64 | 760 | 6.8x |
Throughput scales well to batch 32; marginal past that.
Pushing Decode Further
- AWQ / GPTQ INT4: 20-30% faster at batch 1 (halves weight bytes)
- EXL2 4.0 bpw: similar to AWQ, sometimes slightly faster
- Speculative decoding: 1.5-2x at batch 1 for structured outputs
- Higher batch: amortises weight loads across sequences
Decode-Optimised LLM Hosting
112 t/s on Llama 3 8B FP8, 285 t/s on Phi-3. UK dedicated hosting.
Order the RTX 5060 Ti 16GBSee also: prefill benchmark, TTFT p99, max throughput, batch size tuning, AWQ guide.