How many tokens per second can one RTX 5060 Ti 16GB at our hosting output at absolute peak? These are the ceilings with full tuning.
Contents
Peak Throughput Config
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--quantization fp8 \
--kv-cache-dtype fp8 \
--enable-chunked-prefill \
--max-num-batched-tokens 4096 \
--max-num-seqs 64 \
--max-model-len 8192 \
--gpu-memory-utilization 0.95
Key changes from latency-tuned: higher max-num-seqs, higher utilisation, shorter max-model-len to reclaim KV budget.
Peak Numbers
| Model | Peak aggregate t/s | Batch | At p99 decode latency |
|---|---|---|---|
| Phi-3-mini FP8 | 2,050 | 96 | 180 ms |
| Llama 3.2 3B FP8 | 1,300 | 80 | 220 ms |
| Mistral 7B FP8 | 830 | 48 | 310 ms |
| Llama 3.1 8B FP8 | 780 | 48 | 340 ms |
| Gemma 2 9B FP8 | 560 | 32 | 380 ms |
| Qwen 2.5 14B AWQ | 360 | 20 | 520 ms |
Phi-3 peaks above 2,000 t/s aggregate – one card processes ~50M tokens per day at this level.
What Limits the Ceiling
- Memory bandwidth (448 GB/s). Decode is bandwidth-bound. Doubling batch size eventually stops improving because each forward pass already saturates bandwidth.
- KV cache capacity. On 16 GB, high batch means short per-sequence context.
- Prefill compute. At high concurrency, prefill eats more of the schedule – chunked prefill helps.
Latency Trade-offs
Peak throughput mode sacrifices per-user experience. At 2,000 t/s aggregate on Phi-3, each user gets 20 t/s – livable but not premium. Decide whether your product tolerates it.
For interactive chat prefer moderate batch sizes. For bulk completion API the peak config is right.
Peak Throughput on Blackwell 16GB
Up to 2,000 t/s aggregate on small models. UK dedicated hosting.
Order the RTX 5060 Ti 16GBSee also: concurrent users, batch size tuning, tokens/watt, TTFT p99, decode benchmark.