Mistral Nemo 12B has two claims to attention: it supports 128k context out of the box, and it fits a single 16-24 GB GPU at INT8. On our dedicated GPU hosting it is the default pick for workloads that need long-context LLM processing without flagship hardware.
Contents
VRAM
Context length directly affects VRAM for KV cache. At 128k the cache cost becomes dominant:
| Context | KV per sequence (FP16) |
|---|---|
| 8k | ~1 GB |
| 32k | ~4 GB |
| 128k | ~16 GB |
Weights at INT4 are ~7 GB. Running 128k context at INT4 needs ~23 GB for one sequence – a single 24 GB 3090 can host one long-context request but not multiple.
GPU Options
- 3090 24 GB: 128k single-user is possible; multi-user needs smaller context
- 5090 32 GB: comfortable multi-user at 64k, single-user at 128k
- 6000 Pro 96 GB: multi-user at 128k with headroom
Deployment
python -m vllm.entrypoints.openai.api_server \
--model mistralai/Mistral-Nemo-Instruct-2407 \
--quantization awq \
--max-model-len 65536 \
--kv-cache-dtype fp8 \
--gpu-memory-utilization 0.92
--kv-cache-dtype fp8 halves KV cache size with minor quality impact – essential for running long context on smaller cards.
Long Context
Long context costs money (more VRAM, slower prefill). Use it when you actually need to reference distant information: full-document Q&A, long chat sessions, multi-turn agents with tool outputs. Do not pad context with retrieved chunks when a shorter window plus better retrieval would be faster.
Long-Context LLM Hosting
Mistral Nemo 12B at 128k context, preconfigured on UK dedicated GPUs.
Browse GPU ServersFor shorter-context workloads see Mistral Small 3 24B. For higher-quality long context see Qwen 2.5 14B.