Mistral Nemo 12B offers 128k context natively – tempting for long-document workloads. On the RTX 5060 Ti 16GB at our hosting the model fits comfortably but long context requires careful KV cache management.
Contents
Fit
| Precision | Weights |
|---|---|
| FP16 | ~24 GB – does not fit |
| FP8 | ~12 GB – fits tight |
| AWQ INT4 | ~7 GB – comfortable |
KV Cache at Long Context
Per-sequence KV cache scales linearly with context length. For Nemo 12B:
| Context | KV per seq (FP16) | KV per seq (FP8) |
|---|---|---|
| 8k | ~1 GB | ~0.5 GB |
| 32k | ~4 GB | ~2 GB |
| 64k | ~8 GB | ~4 GB |
| 128k | ~16 GB | ~8 GB |
At 128k a single sequence at FP16 KV fills the entire card. FP8 KV halves that to 8 GB per sequence – still only 1 concurrent 128k sequence alongside weights.
Deployment
For practical multi-user serving, cap context at 32k:
python -m vllm.entrypoints.openai.api_server \
--model mistralai/Mistral-Nemo-Instruct-2407 \
--quantization awq \
--max-model-len 32768 \
--kv-cache-dtype fp8 \
--gpu-memory-utilization 0.92
For long-context single-user workloads:
--max-model-len 131072 --max-num-seqs 1 --kv-cache-dtype fp8
Single vs Multi-User
| Mode | Config | Concurrent |
|---|---|---|
| Multi-user chat | AWQ, 8k ctx, FP8 KV | 12-16 |
| RAG with 32k retrieved context | AWQ, 32k ctx, FP8 KV | 4-6 |
| Long document analysis | AWQ, 128k ctx, FP8 KV | 1 |
For long-context multi-user workloads step up to RTX 5090 32GB.
Long-Context LLM at Mid-Tier
128k context Mistral Nemo on Blackwell 16GB. UK dedicated hosting.
Order the RTX 5060 Ti 16GBSee also: 128k context guide, FP8 KV cache tuning, context budget.