Long-context LLMs on the RTX 5060 Ti 16GB require careful KV cache management. Here is what works on our hosting and where you hit walls.
Contents
KV Cache
Per-sequence KV cache scales linearly with context. Rough numbers for a 12B model:
| Context | FP16 KV | FP8 KV |
|---|---|---|
| 8k | ~1 GB | ~0.5 GB |
| 32k | ~4 GB | ~2 GB |
| 64k | ~8 GB | ~4 GB |
| 128k | ~16 GB | ~8 GB |
At 128k one FP16 KV sequence fills the entire card. FP8 KV halves that to 8 GB per sequence.
Models
- Mistral Nemo 12B AWQ (128k native): weights 7 GB, 128k single-user with FP8 KV fits (7+8 = 15 GB)
- Llama 3.2 1B: weights 2 GB, 128k easy with room for multi-user
- Phi-3.5-mini (128k): weights 8 GB, 128k works with FP8 KV for 1-2 users
- Qwen 2.5 14B AWQ: 32k native (extendable), 32k practical on this card
- GLM-4 9B-1m: 128k variant, similar math to Mistral Nemo
Tuning
python -m vllm.entrypoints.openai.api_server \
--model mistralai/Mistral-Nemo-Instruct-2407 \
--quantization awq \
--max-model-len 131072 \
--max-num-seqs 1 \
--kv-cache-dtype fp8 \
--enable-prefix-caching \
--gpu-memory-utilization 0.92
max_num_seqs=1 for single-user 128k. For multi-user drop max_model_len to 32k or use Phi-3.5-mini.
Context vs Concurrency
On 16 GB the trade is sharp:
- 32k context, 4-6 concurrent sequences
- 64k context, 2-3 concurrent
- 128k context, 1 concurrent
For multi-user long-context workloads step up to RTX 5090 32GB where multi-user 128k becomes viable.
Long-Context Mid-Tier Hosting
128k context where the model fits. UK dedicated hosting.
Order the RTX 5060 Ti 16GBSee also: Mistral Nemo deployment, long-context performance.