RTX 3050 - Order Now
Home / Blog / Model Guides / RTX 4090 24GB for Mistral Nemo 12B: 128k Context at FP8 with deep VRAM math
Model Guides

RTX 4090 24GB for Mistral Nemo 12B: 128k Context at FP8 with deep VRAM math

Deep deployment guide for Mistral Nemo 12B on the RTX 4090 24GB - GQA-optimised KV math, full 128k context at FP8, Tekken tokeniser, vLLM recipes and gotchas.

Mistral Nemo 12B is the under-appreciated star of the 12B class. Built by Mistral and NVIDIA together in mid-2024, it ships with a native 128k context window using full attention (no sliding window), a refreshed Tekken tokeniser that is roughly 30% more efficient on code and CJK than Llama’s, an Apache 2.0 licence, and grouped-query attention with 8 KV heads. That last detail is what makes it sing on a 24 GB card: long-context decode that would crush Phi-3 Medium runs comfortably on a single RTX 4090 24GB dedicated server. This article on our UK GPU hosting covers the architecture in detail, gives the VRAM math at every realistic context length, lists throughput at single and batched concurrency, walks through deployment, and flags the production gotchas you need to know.

Contents

Architecture and licence

Mistral Nemo has 12.2B parameters across 40 layers, hidden dimension 5120, intermediate 14336, with 32 query heads and 8 KV heads (head_dim 128). The 4:1 GQA ratio is identical to Llama 3 8B and is what keeps the KV cache manageable even at 128k context. Apache 2.0 means no compliance friction; there are no usage caps or recipient restrictions in the licence text.

The Tekken tokeniser

Tekken is a SentencePiece variant trained on a more code- and multilingual-heavy corpus than Llama’s. Practical effect: a 1k-token Llama prompt typically tokenises to ~750 tokens in Tekken for English code, and even less for Chinese or Japanese. That is a free 25-30% throughput win on identical-looking text. It also means your token-budget math from a Llama-based service does not transfer one-to-one when you migrate.

VRAM math at 128k

For Nemo the KV cache cost is 2 * 40 layers * 8 KV heads * 128 head_dim * bytes, which is 81,920 bytes per token at FP8 (80 KB/token), or 160 KB/token at FP16. That is roughly 1/5 of Phi-3 Medium’s per-token cost despite being a similar-sized model.

ComponentFP16FP8 W8A8AWQ INT4
Weights24.4 GB12.2 GB7.0 GB
Activations + workspace1.0 GB1.0 GB1.0 GB
CUDA / runtime overhead0.7 GB0.7 GB0.7 GB
KV @ 8k FP80.7 GB0.7 GB0.7 GB
KV @ 32k FP82.6 GB2.6 GB2.6 GB
KV @ 128k FP810.5 GB10.5 GB10.5 GB
Total @ 128k FP8OOM (36.6 GB)24.4 GB (very tight)19.2 GB
Total @ 32k FP8OOM16.5 GB11.3 GB

FP16 weights at 128k cannot fit on a 24 GB card. FP8 weights with FP8 KV at full 128k is technically possible but leaves zero headroom for batched serving; either lower --gpu-memory-utilization below 0.95, or switch to AWQ INT4 which gives you ~5 GB of free VRAM for batching even at 128k. See the AWQ guide for kernel choice.

Throughput and concurrency

Nemo’s decode is bandwidth-bound at small batch and compute-bound past batch 8. The 4090’s 1008 GB/s GDDR6X plus 72 MB L2 keeps both regimes efficient.

PrecisionBatch 1, 4k ctxBatch 1, 32k ctxBatch 1, 128k ctxBatch 4 aggBatch 8 aggBatch 16 agg
FP1672 t/sOOMOOMOOMOOMOOM
FP8 W8A8145 t/s132 t/s96 t/s410 t/s620 t/s880 t/s
AWQ INT4175 t/s156 t/s118 t/s490 t/s740 t/s1,020 t/s

96 t/s at full 128k context is excellent: long-context decode is almost always KV-bandwidth limited, and the 4090’s combination of bandwidth and L2 keeps Nemo efficient even at extreme context. Compare with the Llama 3 8B benchmark for a same-GQA-pattern peer.

vLLM deployment

Two reference launches. The first is the long-context FP8 deploy that we run for retrieval workloads; the second is the higher-throughput AWQ deploy for chat.

pip install "vllm>=0.6.2"

python -m vllm.entrypoints.openai.api_server \
  --model mistralai/Mistral-Nemo-Instruct-2407 \
  --quantization fp8 --kv-cache-dtype fp8 \
  --max-model-len 131072 --max-num-seqs 4 \
  --gpu-memory-utilization 0.94 \
  --enable-prefix-caching --enable-chunked-prefill
python -m vllm.entrypoints.openai.api_server \
  --model casperhansen/mistral-nemo-instruct-2407-awq \
  --quantization awq_marlin --kv-cache-dtype fp8 \
  --max-model-len 32768 --max-num-seqs 16 \
  --gpu-memory-utilization 0.95 \
  --enable-prefix-caching

Drop --max-model-len to 32k and you can raise --max-num-seqs to 16 on AWQ for high-throughput chat. See the vLLM setup guide and FP8 deployment for image build details.

Quality benchmarks and scenarios

BenchmarkNemo 12BLlama 3.1 8BPhi-3 Medium 14BQwen 2.5 14B
MMLU68.069.478.080.0
HumanEval40.062.262.283.5
MT-Bench8.358.108.078.40
RULER @ 128kstrongmoderateweakmoderate
Multilingual MMLUstrongmoderatemoderatestrong

Scenario A: long-document RAG over 200-page contracts

A legal-tech product retrieves and reasons over UK contract bundles up to 80k tokens. Nemo’s 128k full attention combined with FP8 KV on a single 4090 holds the full bundle in context. Decode at 128k still runs at ~96 t/s — usable for chat-style turn lengths. See the SaaS RAG sizing for batching trade-offs.

Scenario B: multilingual support automation across DE/FR/JP

An EMEA SaaS routes inbound tickets in five languages. Nemo’s multilingual training plus Tekken tokeniser produces ~25% throughput uplift on Japanese vs an equivalent Llama deployment. AWQ at max-num-seqs 16 handles ~50 sustained agents.

Scenario C: tool-calling agent with structured JSON

Nemo was instruction-tuned with reliable tool-token semantics and emits clean JSON for structured output. With vLLM’s guided decoding (outlines or xgrammar backend) it is one of the most reliable open 12B-class models for an agent loop.

When Nemo wins, when it loses

WorkloadPickWhy
RAG over very long documentsNemo 12B128k full attention with GQA
Multilingual chat (DE/FR/JP/ZH)Nemo 12BTekken tokeniser, multilingual training
Pure English knowledge Q&APhi-3 Medium / Qwen 14BHigher MMLU at similar size
Code-completionQwen 2.5 Coder 14BHumanEval 88, beats Nemo 2x
Highest throughput short promptsLlama 3.1 8B / Mistral 7BHigher t/s, lower KV
Tool-use agent with JSONNemo 12BReliable structured output

Production gotchas

  1. Tekken tokeniser changes your token math. Migration from a Llama service will appear “free” because every prompt costs ~25% fewer tokens, but invoice models built on Llama tokens overestimate cost — recalibrate.
  2. vLLM < 0.6.2 had attention bugs at long context. Pin 0.6.2+ and ideally 0.6.4 for the chunked-prefill stability fixes.
  3. Full 128k FP8 on a single 4090 is tight. gpu-memory-utilization above 0.95 will fail at runtime when the KV pool tries to grow under traffic; either cap context, switch to AWQ, or set 0.93 with smaller max-num-seqs.
  4. Chat template peculiarity. Nemo’s official template differs from Mistral 7B’s — use tokenizer.apply_chat_template and never reuse a Mistral 7B template.
  5. RoPE base vs Llama. Nemo uses RoPE base 1,000,000. Tools that hardcode 10,000 will produce garbage at long context — applies to some custom serving stacks, not vLLM.
  6. Long-context cost at scale. KV grows linearly. A 50-tenant deployment averaging 64k context needs ~26 GB just for KV — plan tenancy carefully.
  7. AWQ checkpoint quality. The community AWQ ports of Nemo vary; verify on a small Q&A holdout that the AWQ build hits within 1 point of the FP16 reference on MMLU before shipping.

Verdict

For long-context multilingual workloads on a single 4090, Nemo 12B is the strongest open choice at the price. It loses to Phi-3 and Qwen on pure knowledge benchmarks, and to Qwen Coder on code, but it wins decisively on context length, multilingual quality, and tool-calling reliability. Pair it with prefix caching and chunked prefill for the best results.

128k context, single 4090, hosted in the UK

Run Mistral Nemo 12B FP8 at full context, AWQ for chat throughput. UK dedicated hosting.

Order the RTX 4090 24GB

See also: Mistral 7B, Llama 3 8B, Phi-3 Medium, FP8 deployment, vLLM setup, tier positioning, SaaS RAG.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?