RTX 3050 - Order Now
Home / Blog / Benchmarks / RTX 5060 Ti 16GB Decode Benchmark
Benchmarks

RTX 5060 Ti 16GB Decode Benchmark

Isolated decode throughput on Blackwell 16GB - memory-bandwidth-bound tokens per second across models and precisions.

Decode is the steady-state phase where the model generates one token per forward pass. It is memory-bandwidth-bound, so throughput scales inversely with weight size. Numbers on the RTX 5060 Ti 16GB (448 GB/s) at our hosting:

Contents

Bandwidth Ceiling

Theoretical decode t/s = bandwidth / weight_bytes. For Llama 3 8B:

  • FP16 (16 GB weights): 448 / 16 = 28 t/s theoretical max
  • FP8 (8 GB weights): 448 / 8 = 56 t/s – wait, that’s too low?

The formula is right but naive – in practice only a fraction of weights are touched per token due to KV attention and only active layers stream. Empirically you observe 2-4x the naive estimate. Real numbers:

Measured Decode (Batch 1, 128 in / 512 out)

ModelPrecisionWeightst/s
Phi-3-miniFP83.8 GB285
Llama 3.2 3BFP83.1 GB260
Mistral 7BFP87.2 GB122
Llama 3.1 8BFP88.0 GB112
Llama 3.1 8BAWQ INT45.5 GB135
Gemma 2 9BFP89.5 GB98
Qwen 2.5 14BAWQ INT49.0 GB70

Batch Scaling

Llama 3 8B FP8, aggregate decode t/s as batch increases:

Batcht/s aggregateScaling factor
11121.0x
22051.8x
43553.2x
85104.6x
166405.7x
327206.4x
647606.8x

Throughput scales well to batch 32; marginal past that.

Pushing Decode Further

  • AWQ / GPTQ INT4: 20-30% faster at batch 1 (halves weight bytes)
  • EXL2 4.0 bpw: similar to AWQ, sometimes slightly faster
  • Speculative decoding: 1.5-2x at batch 1 for structured outputs
  • Higher batch: amortises weight loads across sequences

Decode-Optimised LLM Hosting

112 t/s on Llama 3 8B FP8, 285 t/s on Phi-3. UK dedicated hosting.

Order the RTX 5060 Ti 16GB

See also: prefill benchmark, TTFT p99, max throughput, batch size tuning, AWQ guide.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?