Memory bandwidth is the single most important spec for LLM decode performance. The RTX 5060 Ti 16GB runs GDDR7 delivering ~448 GB/s on our dedicated hosting. Here is what that number delivers in practice and how it ranks across the lineup.
Contents
- The number
- Why bandwidth dominates decode
- Decode throughput by model
- Rank in the lineup
- Practical impact
The Number
448 GB/s theoretical. Delivered by GDDR7 at 28 Gbps per pin on a 128-bit bus. Practical sustained bandwidth in production AI workloads: 380-420 GB/s depending on access pattern.
The GDDR7 generation uses PAM3 signalling (three-level pulse) instead of NRZ used in GDDR6. More bits per clock at similar power envelope – part of why the 5060 Ti gets +55% bandwidth over the 4060 Ti at only +15 W TDP.
Why Bandwidth Dominates
LLM decode reads the full weight set per token. For a 7B FP16 model (14 GB weights), the GPU reads 14 GB of memory to emit one token. Theoretical ceiling = bandwidth / weight size:
- 448 / 14 = 32 tokens/sec at FP16 theoretical max
- Practical ~70-80% of ceiling: ~25 t/s
At lower precision the weights shrink and throughput rises linearly:
- INT8 or FP8 (7 GB weights): ~65 t/s theoretical, 50-55 practical
- INT4 (3.5 GB weights): ~130 t/s theoretical, 95 practical
Compute TFLOPS rarely matter for decode – the tensor cores sit idle waiting for memory.
Decode Throughput by Model
| Model | Weights | Theoretical t/s | Measured t/s |
|---|---|---|---|
| Phi-3-mini 3.8B BF16 | ~7 GB | ~64 | ~135 (smaller attention overhead) |
| Mistral 7B FP8 | ~7 GB | ~64 | ~110 |
| Llama 3 8B FP8 | ~8 GB | ~56 | ~105 |
| Gemma 2 9B FP8 | ~9 GB | ~50 | ~78 |
| Qwen 2.5 14B AWQ INT4 | ~8 GB | ~56 | ~44 (larger compute cost) |
Lineup Rank
| Card | Memory | Bandwidth |
|---|---|---|
| RTX 6000 Pro | 96 GB | ~1,800 GB/s |
| RTX 5090 | 32 GB | ~1,792 GB/s |
| RTX 5080 | 16 GB | ~960 GB/s |
| RTX 3090 | 24 GB | ~936 GB/s |
| RX 9070 XT | 16 GB | ~640 GB/s |
| RTX 5060 Ti 16GB | 16 GB | ~448 GB/s |
| RTX 5060 8GB | 8 GB | ~448 GB/s |
| RTX 4060 Ti 16GB | 16 GB | ~288 GB/s |
| RTX 4060 | 8 GB | ~272 GB/s |
448 GB/s places the 5060 Ti 16GB above the previous Ada mid-tier by 55% and in territory that consumer cards reached only at the 5080+ class in prior generations.
Practical Impact
For decode-bound chat workloads (the most common production LLM pattern), upgrading from a 4060 Ti to a 5060 Ti delivers roughly 50-80% more tokens per second on the same model with no other changes. For prefill-heavy workloads (long RAG contexts) the compute gains matter more and the speed-up is smaller but still positive.
The 5060 Ti bandwidth is adequate for production serving of 7-14B models. For 70B models where the weights barely fit even at INT4, stepping up to a 5090 (1,792 GB/s) gives dramatic further gains.
See also: full lineup bandwidth ranking, GDDR7 advantage, decode benchmark.