Home / Blog / Model Guides / RTX 4090 24GB for Mistral Nemo 12B: 128k Context at FP8 with deep VRAM math

Model Guides

RTX 4090 24GB for Mistral Nemo 12B: 128k Context at FP8 with deep VRAM math

Deep deployment guide for Mistral Nemo 12B on the RTX 4090 24GB - GQA-optimised KV math, full 128k context at FP8, Tekken tokeniser, vLLM recipes and gotchas.

Model Guides May 4, 2026 5 min read gigagpu

Mistral Nemo 12B is the under-appreciated star of the 12B class. Built by Mistral and NVIDIA together in mid-2024, it ships with a native 128k context window using full attention (no sliding window), a refreshed Tekken tokeniser that is roughly 30% more efficient on code and CJK than Llama’s, an Apache 2.0 licence, and grouped-query attention with 8 KV heads. That last detail is what makes it sing on a 24 GB card: long-context decode that would crush Phi-3 Medium runs comfortably on a single RTX 4090 24GB dedicated server. This article on our UK GPU hosting covers the architecture in detail, gives the VRAM math at every realistic context length, lists throughput at single and batched concurrency, walks through deployment, and flags the production gotchas you need to know.

Architecture and licence

Mistral Nemo has 12.2B parameters across 40 layers, hidden dimension 5120, intermediate 14336, with 32 query heads and 8 KV heads (head_dim 128). The 4:1 GQA ratio is identical to Llama 3 8B and is what keeps the KV cache manageable even at 128k context. Apache 2.0 means no compliance friction; there are no usage caps or recipient restrictions in the licence text.

The Tekken tokeniser

Tekken is a SentencePiece variant trained on a more code- and multilingual-heavy corpus than Llama’s. Practical effect: a 1k-token Llama prompt typically tokenises to ~750 tokens in Tekken for English code, and even less for Chinese or Japanese. That is a free 25-30% throughput win on identical-looking text. It also means your token-budget math from a Llama-based service does not transfer one-to-one when you migrate.

VRAM math at 128k

For Nemo the KV cache cost is 2 * 40 layers * 8 KV heads * 128 head_dim * bytes, which is 81,920 bytes per token at FP8 (80 KB/token), or 160 KB/token at FP16. That is roughly 1/5 of Phi-3 Medium’s per-token cost despite being a similar-sized model.

Component	FP16	FP8 W8A8	AWQ INT4
Weights	24.4 GB	12.2 GB	7.0 GB
Activations + workspace	1.0 GB	1.0 GB	1.0 GB
CUDA / runtime overhead	0.7 GB	0.7 GB	0.7 GB
KV @ 8k FP8	0.7 GB	0.7 GB	0.7 GB
KV @ 32k FP8	2.6 GB	2.6 GB	2.6 GB
KV @ 128k FP8	10.5 GB	10.5 GB	10.5 GB
Total @ 128k FP8	OOM (36.6 GB)	24.4 GB (very tight)	19.2 GB
Total @ 32k FP8	OOM	16.5 GB	11.3 GB

FP16 weights at 128k cannot fit on a 24 GB card. FP8 weights with FP8 KV at full 128k is technically possible but leaves zero headroom for batched serving; either lower --gpu-memory-utilization below 0.95, or switch to AWQ INT4 which gives you ~5 GB of free VRAM for batching even at 128k. See the AWQ guide for kernel choice.

Throughput and concurrency

Nemo’s decode is bandwidth-bound at small batch and compute-bound past batch 8. The 4090’s 1008 GB/s GDDR6X plus 72 MB L2 keeps both regimes efficient.

Precision	Batch 1, 4k ctx	Batch 1, 32k ctx	Batch 1, 128k ctx	Batch 4 agg	Batch 8 agg	Batch 16 agg
FP16	72 t/s	OOM	OOM	OOM	OOM	OOM
FP8 W8A8	145 t/s	132 t/s	96 t/s	410 t/s	620 t/s	880 t/s
AWQ INT4	175 t/s	156 t/s	118 t/s	490 t/s	740 t/s	1,020 t/s

96 t/s at full 128k context is excellent: long-context decode is almost always KV-bandwidth limited, and the 4090’s combination of bandwidth and L2 keeps Nemo efficient even at extreme context. Compare with the Llama 3 8B benchmark for a same-GQA-pattern peer.

vLLM deployment

Two reference launches. The first is the long-context FP8 deploy that we run for retrieval workloads; the second is the higher-throughput AWQ deploy for chat.

pip install "vllm>=0.6.2"

python -m vllm.entrypoints.openai.api_server \
  --model mistralai/Mistral-Nemo-Instruct-2407 \
  --quantization fp8 --kv-cache-dtype fp8 \
  --max-model-len 131072 --max-num-seqs 4 \
  --gpu-memory-utilization 0.94 \
  --enable-prefix-caching --enable-chunked-prefill

python -m vllm.entrypoints.openai.api_server \
  --model casperhansen/mistral-nemo-instruct-2407-awq \
  --quantization awq_marlin --kv-cache-dtype fp8 \
  --max-model-len 32768 --max-num-seqs 16 \
  --gpu-memory-utilization 0.95 \
  --enable-prefix-caching

Drop --max-model-len to 32k and you can raise --max-num-seqs to 16 on AWQ for high-throughput chat. See the vLLM setup guide and FP8 deployment for image build details.

Quality benchmarks and scenarios

Benchmark	Nemo 12B	Llama 3.1 8B	Phi-3 Medium 14B	Qwen 2.5 14B
MMLU	68.0	69.4	78.0	80.0
HumanEval	40.0	62.2	62.2	83.5
MT-Bench	8.35	8.10	8.07	8.40
RULER @ 128k	strong	moderate	weak	moderate
Multilingual MMLU	strong	moderate	moderate	strong

Scenario A: long-document RAG over 200-page contracts

A legal-tech product retrieves and reasons over UK contract bundles up to 80k tokens. Nemo’s 128k full attention combined with FP8 KV on a single 4090 holds the full bundle in context. Decode at 128k still runs at ~96 t/s — usable for chat-style turn lengths. See the SaaS RAG sizing for batching trade-offs.

Scenario B: multilingual support automation across DE/FR/JP

An EMEA SaaS routes inbound tickets in five languages. Nemo’s multilingual training plus Tekken tokeniser produces ~25% throughput uplift on Japanese vs an equivalent Llama deployment. AWQ at max-num-seqs 16 handles ~50 sustained agents.

Scenario C: tool-calling agent with structured JSON

Nemo was instruction-tuned with reliable tool-token semantics and emits clean JSON for structured output. With vLLM’s guided decoding (outlines or xgrammar backend) it is one of the most reliable open 12B-class models for an agent loop.

When Nemo wins, when it loses

Workload	Pick	Why
RAG over very long documents	Nemo 12B	128k full attention with GQA
Multilingual chat (DE/FR/JP/ZH)	Nemo 12B	Tekken tokeniser, multilingual training
Pure English knowledge Q&A	Phi-3 Medium / Qwen 14B	Higher MMLU at similar size
Code-completion	Qwen 2.5 Coder 14B	HumanEval 88, beats Nemo 2x
Highest throughput short prompts	Llama 3.1 8B / Mistral 7B	Higher t/s, lower KV
Tool-use agent with JSON	Nemo 12B	Reliable structured output

Production gotchas

Tekken tokeniser changes your token math. Migration from a Llama service will appear “free” because every prompt costs ~25% fewer tokens, but invoice models built on Llama tokens overestimate cost — recalibrate.
vLLM < 0.6.2 had attention bugs at long context. Pin 0.6.2+ and ideally 0.6.4 for the chunked-prefill stability fixes.
Full 128k FP8 on a single 4090 is tight. gpu-memory-utilization above 0.95 will fail at runtime when the KV pool tries to grow under traffic; either cap context, switch to AWQ, or set 0.93 with smaller max-num-seqs.
Chat template peculiarity. Nemo’s official template differs from Mistral 7B’s — use tokenizer.apply_chat_template and never reuse a Mistral 7B template.
RoPE base vs Llama. Nemo uses RoPE base 1,000,000. Tools that hardcode 10,000 will produce garbage at long context — applies to some custom serving stacks, not vLLM.
Long-context cost at scale. KV grows linearly. A 50-tenant deployment averaging 64k context needs ~26 GB just for KV — plan tenancy carefully.
AWQ checkpoint quality. The community AWQ ports of Nemo vary; verify on a small Q&A holdout that the AWQ build hits within 1 point of the FP16 reference on MMLU before shipping.

Verdict

For long-context multilingual workloads on a single 4090, Nemo 12B is the strongest open choice at the price. It loses to Phi-3 and Qwen on pure knowledge benchmarks, and to Qwen Coder on code, but it wins decisively on context length, multilingual quality, and tool-calling reliability. Pair it with prefix caching and chunked prefill for the best results.

128k context, single 4090, hosted in the UK

Run Mistral Nemo 12B FP8 at full context, AWQ for chat throughput. UK dedicated hosting.

Order the RTX 4090 24GB

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Model Guides

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

RTX 4090 24GB for Mistral Nemo 12B: 128k Context at FP8 with deep VRAM math

Contents

Architecture and licence

The Tekken tokeniser

VRAM math at 128k

Throughput and concurrency

vLLM deployment

Quality benchmarks and scenarios

Scenario A: long-document RAG over 200-page contracts

Scenario B: multilingual support automation across DE/FR/JP

Scenario C: tool-calling agent with structured JSON

When Nemo wins, when it loses

Production gotchas

Verdict

128k context, single 4090, hosted in the UK

Need a Dedicated GPU Server?

gigagpu

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

RTX 4090 24GB for Mistral Nemo 12B: 128k Context at FP8 with deep VRAM math

Contents

Architecture and licence

The Tekken tokeniser

VRAM math at 128k

Throughput and concurrency

vLLM deployment

Quality benchmarks and scenarios

Scenario A: long-document RAG over 200-page contracts

Scenario B: multilingual support automation across DE/FR/JP

Scenario C: tool-calling agent with structured JSON

When Nemo wins, when it loses

Production gotchas

Verdict

128k context, single 4090, hosted in the UK

Need a Dedicated GPU Server?

gigagpu

Related Articles

Whisper for Real-Time Transcription: GPU Sizing and Latency Budget

6GB VRAM Models That Fit: What You Can and Cannot Run

Qwen 2.5 Coder vs Qwen 2.5 Chat: Code-Specific Models

How to Run Flux.1 on a Dedicated GPU Server

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?