TTFT (time to first token) is the latency a user sees before your chat bubble starts streaming. p99 matters more than p50 because tail latency spikes drive complaints. Numbers on the RTX 5060 Ti 16GB at our hosting:
Contents
Baseline, Batch 1 (Llama 3.1 8B FP8)
| Prompt length | p50 TTFT | p99 TTFT |
|---|---|---|
| 128 tok | 110 ms | 160 ms |
| 512 tok | 180 ms | 230 ms |
| 2,048 tok | 400 ms | 490 ms |
| 8,192 tok | 1,350 ms | 1,620 ms |
Under Concurrent Load (8 users, mixed prompts)
| Config | p50 TTFT | p99 TTFT |
|---|---|---|
| No optimisations | 420 ms | 3,800 ms |
| + chunked prefill | 450 ms | 520 ms |
| + prefix caching | 80 ms | 180 ms |
| + both | 75 ms | 160 ms |
The difference between a bad deployment and a tuned one is an order of magnitude in p99.
Tail Latency Fixes
- Enable chunked prefill. Eliminates the classic “one long prompt blocks everyone” spike.
- Enable prefix caching. Dramatic p50 and p99 improvement for repeated prefixes.
- Lower
--max-num-seqs. Fewer concurrent sequences means shorter queues. - Cap prompt length at application layer. Truncate anything over 8k unless needed.
- Monitor. Export vLLM metrics to Prometheus, alert on p99 > 1 s.
With all four in place, single-card p99 TTFT under 200 ms at 8 concurrent is reliably achievable.
Low-Tail-Latency LLM Hosting
p99 TTFT under 200 ms when tuned. UK dedicated hosting.
Order the RTX 5060 Ti 16GBSee also: prefill benchmark, decode benchmark, batch tuning, concurrency.