RTX 3050 - Order Now
Home / Blog / Model Guides / RTX 4090 24GB for Mistral 7B v0.3: Throughput and Sizing
Model Guides

RTX 4090 24GB for Mistral 7B v0.3: Throughput and Sizing

Production deployment guide for Mistral 7B v0.3 on the RTX 4090 24GB: VRAM math, throughput by batch and quant, sliding window attention math, latency profile, vLLM/TGI/Ollama deployment, real workloads (12-engineer team, 200-MAU SaaS), Apache 2.0 licence and how it actually compares to Llama 3.1 8B.

Mistral 7B v0.3 is the leanest, fastest open model still in active production use, and on the RTX 4090 24GB it is the throughput champion of its class: 215 tokens per second per stream at FP8 with comfortable 32k context, an Apache 2.0 licence and a sliding window attention scheme that keeps KV memory bounded regardless of input length. On UK dedicated GPU hosting it is the right pick when raw speed and a permissive licence matter more than the extra benchmark points Llama 3.1 8B brings. This guide walks through the VRAM accounting, the sliding window math that makes long context cheap, throughput by batch and quant, latency profile, deployment options across vLLM, TGI, Ollama and llama.cpp, real workload sizings, and the head-to-head with Llama 3.1 8B.

Contents

Why Mistral 7B v0.3 is still relevant

Mistral 7B v0.3 (Mistral AI, May 2024 update) has 7.25B parameters, 32 layers, 8 KV heads (grouped-query attention), and a 32k effective context with sliding window attention at 4k window. The v0.3 update added a proper extended tokenizer (32k vocabulary, up from 32k but with reserved tokens for tool-use), a tokenizer fix for code, and function-calling tokens. The Apache 2.0 licence is the headline distinction from Llama 3.1: there are no acceptable-use clauses and no MAU cap. Combined, those properties make Mistral 7B the canonical “fast and free” model.

For tooling, RAG, customer support and “small fast generator behind a product surface” use cases it is still the model to beat in its tier. It loses to Llama 3.1 8B on most benchmarks (62.5 MMLU vs 68.4, 30.5 HumanEval vs 62.2) but wins on raw decoded throughput (215 t/s vs 198 t/s at FP8), on KV cost at long context (sliding window caps it), on licence permissiveness, and on the breadth of community fine-tunes that ship for it. For a workload where the answer quality from a 62.5 MMLU model is acceptable and the throughput matters, Mistral 7B is the right call.

VRAM accounting and the sliding window math

Mistral 7B v0.3 has 7.25B parameters. The per-token KV cost for full attention would be 2 (K, V) x 8 (KV heads) x 128 (dim) x 32 (layers) x 2 bytes (FP16) = 131 KB per token, or 65.5 KB at FP8. With sliding window attention at 4k window, the KV is capped at 4k tokens regardless of input length: a 32k input still only stores 4k tokens of KV per layer. The full-attention KV at 32k would be 4.2 GB; sliding window keeps it at 0.5 GB.

ComponentFP16FP8AWQ INT4
Weights14.5 GB7.25 GB4.1 GB
Activations + workspace0.7 GB0.7 GB0.7 GB
CUDA + driver overhead0.6 GB0.6 GB0.6 GB
KV per token (FP8, full attn)0.065 KB0.065 KB0.065 KB
KV @ 32k FP8 (sliding 4k window)0.5 GB0.5 GB0.5 GB
KV @ 32k FP8 (full attention)2.1 GB2.1 GB2.1 GB
Total @ 32k sliding, 1 stream16.3 GB9.0 GB6.0 GB
Total @ 32k sliding, 32 streams (FP8)doesn’t fit24.5 GB (OOM)21.7 GB
Total @ 32k sliding, 16 streams (FP8)doesn’t fit16.5 GB13.4 GB

The sliding window attention caps KV growth: a 4k window means you never store more than 4k tokens of KV per layer regardless of input length. That is why 32k context is so cheap on Mistral compared to a vanilla full-attention 8B model. It also has a quality cost: the model can only attend to the most recent 4k tokens at each layer, so genuinely long-range dependencies lose fidelity. For RAG with chunked retrieval where each retrieval is small, this is invisible; for tasks that genuinely need to integrate signal across 32k tokens, Llama 3.1 8B with full attention is the better pick.

Throughput by batch and quant

Measured on a 4090 24GB with vLLM 0.6.3, FlashAttention 3, sliding window attention enabled, CUDA 12.4. Sustained over a 5-minute window after warmup.

PrecisionSingle-user t/sBatch 8 t/s aggregateBatch 16 t/s aggregateBatch 32 t/s aggregate
FP16105540720820 (plateau)
FP8 E4M32158009801080 (plateau)
AWQ INT424082010201120 (plateau)
GPTQ INT42257909801070

FP8 is the practical winner: ~2x FP16 throughput at MMLU within 0.4 of the reference, and a much shorter critical path than INT4 quantisation. AWQ INT4 wins on raw throughput but loses 1-2 points on MT-Bench and HumanEval. Mistral 7B FP8 at 215 t/s single-stream is faster than Llama 3.1 8B at 198 t/s; at batch 32 the aggregate gap closes to 1080 vs 1100. The plateau at batch 32 is the bandwidth wall described in the GDDR6X bandwidth piece.

Latency profile: TTFT vs prompt length

Prompt lengthTTFT (FP8)Inter-token (FP8)Notes
256 tokens70 ms4.6 msChat reply
1k tokens120 ms4.6 msShort RAG
2k tokens200 ms4.7 msStandard RAG
4k tokens360 ms4.7 msLong RAG
8k tokens740 ms4.8 msDocument QA
16k tokens1500 ms5.0 msEdge of sliding window utility
32k tokens3300 ms5.2 msLong context

TTFT under 400 ms holds up to 4k input. The inter-token time stays nearly flat across context lengths because the sliding window caps attention KV, so the per-token attention cost does not grow with prompt length the way it does for full-attention models. This is the defining performance characteristic of Mistral 7B: long context comes nearly free at decode time.

Deployment options: vLLM, TGI, Ollama, llama.cpp

StackBest forFP8Sliding windowFunction calling
vLLM 0.6+Production OpenAI-compatibleNativeYesYes (with guidance)
TGIHF ecosystemYes (TRT-LLM backend)YesYes
OllamaSingle-user devGGUF onlyYesLimited
llama.cpp serverSmallest stack, mixed CPU+GPUGGUF onlyYesLimited
mistral-inference (official)Reference implementationBF16 onlyYesNative

Production vLLM launch:

pip install "vllm>=0.6.0"

python -m vllm.entrypoints.openai.api_server \
  --model mistralai/Mistral-7B-Instruct-v0.3 \
  --quantization fp8 \                 # E4M3 weights
  --kv-cache-dtype fp8 \               # FP8 KV
  --max-model-len 32768 \              # 32k full context
  --sliding-window 4096 \              # Mistral default
  --max-num-seqs 32 \                  # high concurrency thanks to small KV
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --gpu-memory-utilization 0.92 \
  --port 8000

Set --max-num-seqs 32 aggressively: the small KV footprint thanks to the sliding window means you can serve heavy concurrent traffic without OOM. Production sizing on a single 4090 running this configuration: ~80M output tokens/day at FP8, comfortable 32 active streams, p95 TTFT under 250 ms for 4k prompts. For a function-calling workload, add --guided-decoding-backend lm-format-enforcer and supply a JSON schema; Mistral 7B v0.3 reliably emits valid JSON when constrained.

# For smaller deployments, llama.cpp with GGUF Q4_K_M
./llama-server \
  --model mistral-7b-v0.3-q4_k_m.gguf \
  --ctx-size 32768 \
  --batch-size 256 \
  --n-gpu-layers 33 \
  --port 8001

llama.cpp single-stream throughput is ~165 t/s for Q4_K_M, well below vLLM but with a much smaller stack. The right choice for prototyping or single-user dev tools; not the right choice for production multi-user serving.

Real workloads

WorkloadConcurrent activeContextThroughputp95 TTFT£/M tokens
200-MAU SaaS RAG (10 concurrent)10 streams8k chunks~720 t/s aggregate180 ms£0.13
Customer support bot (24 conc.)24 streams4k history~960 t/s aggregate200 ms£0.10
Tool-call agent backend16 streams16k context~880 t/s aggregate520 ms£0.11
Document classification batch32 streams2k input, 256 out~1080 t/s aggregaten/a (batch)£0.08

For a 200-MAU SaaS RAG running Mistral 7B FP8 on a single 4090, the £329/month hosted cost amortises to roughly £0.13 per million tokens served. For a 12-engineer team using Mistral 7B as a fast fallback model alongside Llama 3.1 8B for higher quality, the same card serves both at ~880 t/s aggregate with FP8 weights and FP8 KV.

vs Llama 3.1 8B head-to-head

MetricMistral 7B v0.3Llama 3.1 8B
Parameters7.25B8.03B
FP8 single-user t/s215198
FP8 batch 32 aggregate t/s10801100
Native context32k (sliding 4k)128k (full)
KV at 32k context0.5 GB (sliding)2.1 GB (full)
MMLU60.868.4
GSM8K52.184.5
HumanEval30.562.2
MT-Bench7.68.30
Tool-calling tokensYes (v0.3)Yes (v3.1)
LicenceApache 2.0Llama 3 Community (700M MAU cap)
Long-range attentionLimited (4k window)Full 128k

Llama 3.1 8B wins on quality benchmarks across the board, often by 6-30 points. Mistral 7B wins on raw single-stream speed (~9 percent faster at FP8), on the simplicity of its sliding window for very long contexts, and on licence permissiveness (Apache 2.0 has no acceptable-use clauses). Pick Mistral 7B for high-volume latency-critical pipelines where 7.6 MT-Bench is enough and the Apache licence matters; pick Llama 3.1 8B otherwise. See Llama 3.1 8B for the parallel writeup.

Production gotchas

  1. Sliding window cuts genuine long-range tasks. If your workload requires integrating signal across 16k+ tokens (long-form summarisation, multi-document reasoning), Llama 3.1 8B’s full attention is the safer pick.
  2. v0.3 tokenizer is not compatible with v0.1/v0.2. Migrating from older Mistral 7B versions changes token counts and may invalidate cached prompts.
  3. Function calling tokens require explicit prompt format. Use [AVAILABLE_TOOLS] and [TOOL_CALLS] tags; bare JSON-in-prompt does not always trigger the tool path.
  4. FP8 quality on creative writing degrades faster than on factual tasks. Validate on your eval if creative output matters.
  5. Continuous batching with sliding window is well-tested in vLLM but flaky in older TGI builds. Use TGI 2.3+ for production.
  6. The 32k context limit is a Mistral architecture choice, not a hardware one. Do not try to extend beyond 32k with rope scaling; quality collapses past 40k.
  7. Pre-quantised FP8 checkpoints are scarce vs Llama. Use neuralmagic‘s where available; otherwise runtime calibration adds 30-60 seconds at startup.

Verdict and when to pick Mistral 7B

Pick Mistral 7B v0.3 on a 4090 24GB when: you need maximum single-stream throughput at 7B-class quality; you want Apache 2.0 licence with no acceptable-use restrictions or MAU caps; your workload fits the 4k sliding window (RAG with small chunks, short conversational history, classification, tool-call); you serve 16-32 concurrent users where the sliding window keeps KV bounded. Skip Mistral 7B if: you need long-range integration across 16k+ tokens (move to Llama 3.1 8B or Mistral Nemo 12B); you need code generation matching GPT-4o mini (move to Llama 3.1 8B or larger); you need MMLU above 65 (move to Llama 3.1 8B at minimum, ideally 70B AWQ).

For a 12-engineer coding team, Mistral 7B is the wrong pick for the primary code model (HumanEval 30 is too low) but is the right pick for a fast secondary model handling auto-complete suggestions where latency dominates quality. For a 200-MAU SaaS RAG with chunked retrieval, Mistral 7B is the right primary choice and the £329/month hosted 4090 supports the realistic peak comfortably.

Mistral 7B at 215 t/s on UK hosting

Apache 2.0 licensed, blistering fast on Ada FP8, 32k context with sliding window. UK dedicated hosting.

Order the RTX 4090 24GB

See also: Llama 3.1 8B comparison, Mistral Nemo 12B, Mistral Small 3, FP8 deployment, vLLM setup, Mistral 7B benchmark, SaaS RAG, coding assistant, 8B LLM VRAM.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?