Mistral Nemo 12B is the under-appreciated star of the 12B class. Built by Mistral and NVIDIA together in mid-2024, it ships with a native 128k context window using full attention (no sliding window), a refreshed Tekken tokeniser that is roughly 30% more efficient on code and CJK than Llama’s, an Apache 2.0 licence, and grouped-query attention with 8 KV heads. That last detail is what makes it sing on a 24 GB card: long-context decode that would crush Phi-3 Medium runs comfortably on a single RTX 4090 24GB dedicated server. This article on our UK GPU hosting covers the architecture in detail, gives the VRAM math at every realistic context length, lists throughput at single and batched concurrency, walks through deployment, and flags the production gotchas you need to know.
Contents
- Architecture and licence
- VRAM math at 128k
- Throughput and concurrency
- vLLM deployment
- Quality benchmarks and scenarios
- When Nemo wins, when it loses
- Production gotchas
- Verdict
Architecture and licence
Mistral Nemo has 12.2B parameters across 40 layers, hidden dimension 5120, intermediate 14336, with 32 query heads and 8 KV heads (head_dim 128). The 4:1 GQA ratio is identical to Llama 3 8B and is what keeps the KV cache manageable even at 128k context. Apache 2.0 means no compliance friction; there are no usage caps or recipient restrictions in the licence text.
The Tekken tokeniser
Tekken is a SentencePiece variant trained on a more code- and multilingual-heavy corpus than Llama’s. Practical effect: a 1k-token Llama prompt typically tokenises to ~750 tokens in Tekken for English code, and even less for Chinese or Japanese. That is a free 25-30% throughput win on identical-looking text. It also means your token-budget math from a Llama-based service does not transfer one-to-one when you migrate.
VRAM math at 128k
For Nemo the KV cache cost is 2 * 40 layers * 8 KV heads * 128 head_dim * bytes, which is 81,920 bytes per token at FP8 (80 KB/token), or 160 KB/token at FP16. That is roughly 1/5 of Phi-3 Medium’s per-token cost despite being a similar-sized model.
| Component | FP16 | FP8 W8A8 | AWQ INT4 |
|---|---|---|---|
| Weights | 24.4 GB | 12.2 GB | 7.0 GB |
| Activations + workspace | 1.0 GB | 1.0 GB | 1.0 GB |
| CUDA / runtime overhead | 0.7 GB | 0.7 GB | 0.7 GB |
| KV @ 8k FP8 | 0.7 GB | 0.7 GB | 0.7 GB |
| KV @ 32k FP8 | 2.6 GB | 2.6 GB | 2.6 GB |
| KV @ 128k FP8 | 10.5 GB | 10.5 GB | 10.5 GB |
| Total @ 128k FP8 | OOM (36.6 GB) | 24.4 GB (very tight) | 19.2 GB |
| Total @ 32k FP8 | OOM | 16.5 GB | 11.3 GB |
FP16 weights at 128k cannot fit on a 24 GB card. FP8 weights with FP8 KV at full 128k is technically possible but leaves zero headroom for batched serving; either lower --gpu-memory-utilization below 0.95, or switch to AWQ INT4 which gives you ~5 GB of free VRAM for batching even at 128k. See the AWQ guide for kernel choice.
Throughput and concurrency
Nemo’s decode is bandwidth-bound at small batch and compute-bound past batch 8. The 4090’s 1008 GB/s GDDR6X plus 72 MB L2 keeps both regimes efficient.
| Precision | Batch 1, 4k ctx | Batch 1, 32k ctx | Batch 1, 128k ctx | Batch 4 agg | Batch 8 agg | Batch 16 agg |
|---|---|---|---|---|---|---|
| FP16 | 72 t/s | OOM | OOM | OOM | OOM | OOM |
| FP8 W8A8 | 145 t/s | 132 t/s | 96 t/s | 410 t/s | 620 t/s | 880 t/s |
| AWQ INT4 | 175 t/s | 156 t/s | 118 t/s | 490 t/s | 740 t/s | 1,020 t/s |
96 t/s at full 128k context is excellent: long-context decode is almost always KV-bandwidth limited, and the 4090’s combination of bandwidth and L2 keeps Nemo efficient even at extreme context. Compare with the Llama 3 8B benchmark for a same-GQA-pattern peer.
vLLM deployment
Two reference launches. The first is the long-context FP8 deploy that we run for retrieval workloads; the second is the higher-throughput AWQ deploy for chat.
pip install "vllm>=0.6.2"
python -m vllm.entrypoints.openai.api_server \
--model mistralai/Mistral-Nemo-Instruct-2407 \
--quantization fp8 --kv-cache-dtype fp8 \
--max-model-len 131072 --max-num-seqs 4 \
--gpu-memory-utilization 0.94 \
--enable-prefix-caching --enable-chunked-prefill
python -m vllm.entrypoints.openai.api_server \
--model casperhansen/mistral-nemo-instruct-2407-awq \
--quantization awq_marlin --kv-cache-dtype fp8 \
--max-model-len 32768 --max-num-seqs 16 \
--gpu-memory-utilization 0.95 \
--enable-prefix-caching
Drop --max-model-len to 32k and you can raise --max-num-seqs to 16 on AWQ for high-throughput chat. See the vLLM setup guide and FP8 deployment for image build details.
Quality benchmarks and scenarios
| Benchmark | Nemo 12B | Llama 3.1 8B | Phi-3 Medium 14B | Qwen 2.5 14B |
|---|---|---|---|---|
| MMLU | 68.0 | 69.4 | 78.0 | 80.0 |
| HumanEval | 40.0 | 62.2 | 62.2 | 83.5 |
| MT-Bench | 8.35 | 8.10 | 8.07 | 8.40 |
| RULER @ 128k | strong | moderate | weak | moderate |
| Multilingual MMLU | strong | moderate | moderate | strong |
Scenario A: long-document RAG over 200-page contracts
A legal-tech product retrieves and reasons over UK contract bundles up to 80k tokens. Nemo’s 128k full attention combined with FP8 KV on a single 4090 holds the full bundle in context. Decode at 128k still runs at ~96 t/s — usable for chat-style turn lengths. See the SaaS RAG sizing for batching trade-offs.
Scenario B: multilingual support automation across DE/FR/JP
An EMEA SaaS routes inbound tickets in five languages. Nemo’s multilingual training plus Tekken tokeniser produces ~25% throughput uplift on Japanese vs an equivalent Llama deployment. AWQ at max-num-seqs 16 handles ~50 sustained agents.
Scenario C: tool-calling agent with structured JSON
Nemo was instruction-tuned with reliable tool-token semantics and emits clean JSON for structured output. With vLLM’s guided decoding (outlines or xgrammar backend) it is one of the most reliable open 12B-class models for an agent loop.
When Nemo wins, when it loses
| Workload | Pick | Why |
|---|---|---|
| RAG over very long documents | Nemo 12B | 128k full attention with GQA |
| Multilingual chat (DE/FR/JP/ZH) | Nemo 12B | Tekken tokeniser, multilingual training |
| Pure English knowledge Q&A | Phi-3 Medium / Qwen 14B | Higher MMLU at similar size |
| Code-completion | Qwen 2.5 Coder 14B | HumanEval 88, beats Nemo 2x |
| Highest throughput short prompts | Llama 3.1 8B / Mistral 7B | Higher t/s, lower KV |
| Tool-use agent with JSON | Nemo 12B | Reliable structured output |
Production gotchas
- Tekken tokeniser changes your token math. Migration from a Llama service will appear “free” because every prompt costs ~25% fewer tokens, but invoice models built on Llama tokens overestimate cost — recalibrate.
- vLLM < 0.6.2 had attention bugs at long context. Pin 0.6.2+ and ideally 0.6.4 for the chunked-prefill stability fixes.
- Full 128k FP8 on a single 4090 is tight.
gpu-memory-utilizationabove 0.95 will fail at runtime when the KV pool tries to grow under traffic; either cap context, switch to AWQ, or set 0.93 with smallermax-num-seqs. - Chat template peculiarity. Nemo’s official template differs from Mistral 7B’s — use
tokenizer.apply_chat_templateand never reuse a Mistral 7B template. - RoPE base vs Llama. Nemo uses RoPE base 1,000,000. Tools that hardcode 10,000 will produce garbage at long context — applies to some custom serving stacks, not vLLM.
- Long-context cost at scale. KV grows linearly. A 50-tenant deployment averaging 64k context needs ~26 GB just for KV — plan tenancy carefully.
- AWQ checkpoint quality. The community AWQ ports of Nemo vary; verify on a small Q&A holdout that the AWQ build hits within 1 point of the FP16 reference on MMLU before shipping.
Verdict
For long-context multilingual workloads on a single 4090, Nemo 12B is the strongest open choice at the price. It loses to Phi-3 and Qwen on pure knowledge benchmarks, and to Qwen Coder on code, but it wins decisively on context length, multilingual quality, and tool-calling reliability. Pair it with prefix caching and chunked prefill for the best results.
128k context, single 4090, hosted in the UK
Run Mistral Nemo 12B FP8 at full context, AWQ for chat throughput. UK dedicated hosting.
Order the RTX 4090 24GBSee also: Mistral 7B, Llama 3 8B, Phi-3 Medium, FP8 deployment, vLLM setup, tier positioning, SaaS RAG.