Long-context performance degrades predictably with prompt length because attention cost grows quadratically. Measured numbers on the RTX 5060 Ti 16GB at our hosting:
Contents
Setup
- Llama 3.1 8B FP8 + FP8 KV
- Qwen 2.5 14B AWQ + FP8 KV (for 32k+)
- Qwen 2.5 7B AWQ + FP8 KV + YaRN (for 128k)
- vLLM 0.6.4, FlashAttention 2.6
TTFT by Prompt Length
| Length | Llama 3 8B FP8 | Qwen 14B AWQ | Qwen 7B (128k) |
|---|---|---|---|
| 8k | 1,250 ms | 3,900 ms | 1,480 ms |
| 16k | 2,700 ms | 8,400 ms | 3,200 ms |
| 32k | 6,100 ms | Exceeds VRAM | 7,400 ms |
| 64k | 14,200 ms | N/A | 17,600 ms |
| 128k | N/A | N/A | 41,000 ms |
TTFT over 10 seconds is a poor UX for interactive chat – use streaming with informational “analysing your document…” text while prefill runs.
Decode Speed by Active KV Size
As KV cache grows, attention cost per decoded token grows. Llama 3.1 8B FP8:
| Active context | Decode t/s |
|---|---|
| 1k | 112 |
| 8k | 95 |
| 32k | 72 |
| 64k | 55 |
Decode holds up reasonably – at 64k context you still get 55 t/s (faster than reading speed).
Chunked Prefill Impact
With chunked prefill enabled, long-prompt requests no longer block concurrent users – the prefill spreads across multiple scheduler steps. Total prefill time is ~15% higher but other users’ decode is smooth.
Verdict
Long-context on this card is usable up to 32k with reasonable TTFT and throughput. 64k is possible. 128k requires the specific Qwen 7B + YaRN config and patient users. For real-world 128k production move to RTX 6000 Pro or similar.
Long-Context LLM on Blackwell 16GB
32k comfortable, 64k workable, 128k possible. UK dedicated hosting.
Order the RTX 5060 Ti 16GBSee also: 128k context guide, context budget, FP8 KV cache, prefix caching, TTFT p99.