Batch size (--max-num-seqs in vLLM) is the single knob with the biggest effect on throughput vs latency. On the RTX 5060 Ti 16GB at our hosting, here are the concrete numbers to help pick a value.
Contents
Batch Sweep (Llama 3.1 8B FP8 + FP8 KV)
| max-num-seqs | Aggregate t/s | Per-user t/s | p50 TTFT | p99 TTFT |
|---|---|---|---|---|
| 1 | 112 | 112 | 120 ms | 180 ms |
| 4 | 355 | 89 | 160 ms | 310 ms |
| 8 | 510 | 64 | 200 ms | 480 ms |
| 16 | 640 | 40 | 280 ms | 780 ms |
| 32 | 720 | 22 | 420 ms | 1,450 ms |
| 48 | 750 | 16 | 560 ms | 2,100 ms |
| 64 | 760 | 12 | 720 ms | 2,800 ms |
Throughput nearly flat past batch 32 – diminishing returns as memory bandwidth saturates. Per-user latency keeps dropping.
Interactive Chat Target
- Goal: 30-60 tokens/sec per user (faster than reading speed)
- Recommended:
--max-num-seqs 16– ~40 t/s per user, 640 aggregate - TTFT p99 under 800 ms
Bulk API Target
- Goal: maximise completions per minute
- Recommended:
--max-num-seqs 32-48– peak aggregate - Accept 1-2s TTFT p99
Recommended Defaults
| Workload | max-num-seqs |
|---|---|
| Interactive chat (SLA) | 16 |
| General purpose (balanced) | 24 |
| Bulk completion API | 32-48 |
| Throughput benchmark | 64+ |
| Low-VRAM model (14B AWQ) | 8 |
vLLM’s default is 256 – which is too high on a 16 GB card and creates KV cache pressure. Always override.
Tuned Blackwell 16GB Hosting
Right batch for your workload. UK dedicated hosting.
Order the RTX 5060 Ti 16GBSee also: max throughput, concurrent users, TTFT p99, decode benchmark.