RTX 4090 24GB for Mistral 7B v0.3: Throughput and Sizing GIGAGPU

Mistral 7B v0.3 is the leanest, fastest open model still in active production use, and on the RTX 4090 24GB it is the throughput champion of its class: 215 tokens per second per stream at FP8 with comfortable 32k context, an Apache 2.0 licence and a sliding window attention scheme that keeps KV memory bounded regardless of input length. On UK dedicated GPU hosting it is the right pick when raw speed and a permissive licence matter more than the extra benchmark points Llama 3.1 8B brings. This guide walks through the VRAM accounting, the sliding window math that makes long context cheap, throughput by batch and quant, latency profile, deployment options across vLLM, TGI, Ollama and llama.cpp, real workload sizings, and the head-to-head with Llama 3.1 8B.

Why Mistral 7B v0.3 is still relevant

Mistral 7B v0.3 (Mistral AI, May 2024 update) has 7.25B parameters, 32 layers, 8 KV heads (grouped-query attention), and a 32k effective context with sliding window attention at 4k window. The v0.3 update added a proper extended tokenizer (32k vocabulary, up from 32k but with reserved tokens for tool-use), a tokenizer fix for code, and function-calling tokens. The Apache 2.0 licence is the headline distinction from Llama 3.1: there are no acceptable-use clauses and no MAU cap. Combined, those properties make Mistral 7B the canonical “fast and free” model.

For tooling, RAG, customer support and “small fast generator behind a product surface” use cases it is still the model to beat in its tier. It loses to Llama 3.1 8B on most benchmarks (62.5 MMLU vs 68.4, 30.5 HumanEval vs 62.2) but wins on raw decoded throughput (215 t/s vs 198 t/s at FP8), on KV cost at long context (sliding window caps it), on licence permissiveness, and on the breadth of community fine-tunes that ship for it. For a workload where the answer quality from a 62.5 MMLU model is acceptable and the throughput matters, Mistral 7B is the right call.

VRAM accounting and the sliding window math

Mistral 7B v0.3 has 7.25B parameters. The per-token KV cost for full attention would be 2 (K, V) x 8 (KV heads) x 128 (dim) x 32 (layers) x 2 bytes (FP16) = 131 KB per token, or 65.5 KB at FP8. With sliding window attention at 4k window, the KV is capped at 4k tokens regardless of input length: a 32k input still only stores 4k tokens of KV per layer. The full-attention KV at 32k would be 4.2 GB; sliding window keeps it at 0.5 GB.

Component	FP16	FP8	AWQ INT4
Weights	14.5 GB	7.25 GB	4.1 GB
Activations + workspace	0.7 GB	0.7 GB	0.7 GB
CUDA + driver overhead	0.6 GB	0.6 GB	0.6 GB
KV per token (FP8, full attn)	0.065 KB	0.065 KB	0.065 KB
KV @ 32k FP8 (sliding 4k window)	0.5 GB	0.5 GB	0.5 GB
KV @ 32k FP8 (full attention)	2.1 GB	2.1 GB	2.1 GB
Total @ 32k sliding, 1 stream	16.3 GB	9.0 GB	6.0 GB
Total @ 32k sliding, 32 streams (FP8)	doesn’t fit	24.5 GB (OOM)	21.7 GB
Total @ 32k sliding, 16 streams (FP8)	doesn’t fit	16.5 GB	13.4 GB

The sliding window attention caps KV growth: a 4k window means you never store more than 4k tokens of KV per layer regardless of input length. That is why 32k context is so cheap on Mistral compared to a vanilla full-attention 8B model. It also has a quality cost: the model can only attend to the most recent 4k tokens at each layer, so genuinely long-range dependencies lose fidelity. For RAG with chunked retrieval where each retrieval is small, this is invisible; for tasks that genuinely need to integrate signal across 32k tokens, Llama 3.1 8B with full attention is the better pick.

Throughput by batch and quant

Measured on a 4090 24GB with vLLM 0.6.3, FlashAttention 3, sliding window attention enabled, CUDA 12.4. Sustained over a 5-minute window after warmup.

Precision	Single-user t/s	Batch 8 t/s aggregate	Batch 16 t/s aggregate	Batch 32 t/s aggregate
FP16	105	540	720	820 (plateau)
FP8 E4M3	215	800	980	1080 (plateau)
AWQ INT4	240	820	1020	1120 (plateau)
GPTQ INT4	225	790	980	1070

FP8 is the practical winner: ~2x FP16 throughput at MMLU within 0.4 of the reference, and a much shorter critical path than INT4 quantisation. AWQ INT4 wins on raw throughput but loses 1-2 points on MT-Bench and HumanEval. Mistral 7B FP8 at 215 t/s single-stream is faster than Llama 3.1 8B at 198 t/s; at batch 32 the aggregate gap closes to 1080 vs 1100. The plateau at batch 32 is the bandwidth wall described in the GDDR6X bandwidth piece.

Latency profile: TTFT vs prompt length

Prompt length	TTFT (FP8)	Inter-token (FP8)	Notes
256 tokens	70 ms	4.6 ms	Chat reply
1k tokens	120 ms	4.6 ms	Short RAG
2k tokens	200 ms	4.7 ms	Standard RAG
4k tokens	360 ms	4.7 ms	Long RAG
8k tokens	740 ms	4.8 ms	Document QA
16k tokens	1500 ms	5.0 ms	Edge of sliding window utility
32k tokens	3300 ms	5.2 ms	Long context

TTFT under 400 ms holds up to 4k input. The inter-token time stays nearly flat across context lengths because the sliding window caps attention KV, so the per-token attention cost does not grow with prompt length the way it does for full-attention models. This is the defining performance characteristic of Mistral 7B: long context comes nearly free at decode time.

Deployment options: vLLM, TGI, Ollama, llama.cpp

Stack	Best for	FP8	Sliding window	Function calling
vLLM 0.6+	Production OpenAI-compatible	Native	Yes	Yes (with guidance)
TGI	HF ecosystem	Yes (TRT-LLM backend)	Yes	Yes
Ollama	Single-user dev	GGUF only	Yes	Limited
llama.cpp server	Smallest stack, mixed CPU+GPU	GGUF only	Yes	Limited
mistral-inference (official)	Reference implementation	BF16 only	Yes	Native

Production vLLM launch:

pip install "vllm>=0.6.0"

python -m vllm.entrypoints.openai.api_server \
  --model mistralai/Mistral-7B-Instruct-v0.3 \
  --quantization fp8 \                 # E4M3 weights
  --kv-cache-dtype fp8 \               # FP8 KV
  --max-model-len 32768 \              # 32k full context
  --sliding-window 4096 \              # Mistral default
  --max-num-seqs 32 \                  # high concurrency thanks to small KV
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --gpu-memory-utilization 0.92 \
  --port 8000

Set --max-num-seqs 32 aggressively: the small KV footprint thanks to the sliding window means you can serve heavy concurrent traffic without OOM. Production sizing on a single 4090 running this configuration: ~80M output tokens/day at FP8, comfortable 32 active streams, p95 TTFT under 250 ms for 4k prompts. For a function-calling workload, add --guided-decoding-backend lm-format-enforcer and supply a JSON schema; Mistral 7B v0.3 reliably emits valid JSON when constrained.

# For smaller deployments, llama.cpp with GGUF Q4_K_M
./llama-server \
  --model mistral-7b-v0.3-q4_k_m.gguf \
  --ctx-size 32768 \
  --batch-size 256 \
  --n-gpu-layers 33 \
  --port 8001

llama.cpp single-stream throughput is ~165 t/s for Q4_K_M, well below vLLM but with a much smaller stack. The right choice for prototyping or single-user dev tools; not the right choice for production multi-user serving.

Real workloads

Workload	Concurrent active	Context	Throughput	p95 TTFT	£/M tokens
200-MAU SaaS RAG (10 concurrent)	10 streams	8k chunks	~720 t/s aggregate	180 ms	£0.13
Customer support bot (24 conc.)	24 streams	4k history	~960 t/s aggregate	200 ms	£0.10
Tool-call agent backend	16 streams	16k context	~880 t/s aggregate	520 ms	£0.11
Document classification batch	32 streams	2k input, 256 out	~1080 t/s aggregate	n/a (batch)	£0.08

For a 200-MAU SaaS RAG running Mistral 7B FP8 on a single 4090, the £329/month hosted cost amortises to roughly £0.13 per million tokens served. For a 12-engineer team using Mistral 7B as a fast fallback model alongside Llama 3.1 8B for higher quality, the same card serves both at ~880 t/s aggregate with FP8 weights and FP8 KV.

vs Llama 3.1 8B head-to-head

Metric	Mistral 7B v0.3	Llama 3.1 8B
Parameters	7.25B	8.03B
FP8 single-user t/s	215	198
FP8 batch 32 aggregate t/s	1080	1100
Native context	32k (sliding 4k)	128k (full)
KV at 32k context	0.5 GB (sliding)	2.1 GB (full)
MMLU	60.8	68.4
GSM8K	52.1	84.5
HumanEval	30.5	62.2
MT-Bench	7.6	8.30
Tool-calling tokens	Yes (v0.3)	Yes (v3.1)
Licence	Apache 2.0	Llama 3 Community (700M MAU cap)
Long-range attention	Limited (4k window)	Full 128k

Llama 3.1 8B wins on quality benchmarks across the board, often by 6-30 points. Mistral 7B wins on raw single-stream speed (~9 percent faster at FP8), on the simplicity of its sliding window for very long contexts, and on licence permissiveness (Apache 2.0 has no acceptable-use clauses). Pick Mistral 7B for high-volume latency-critical pipelines where 7.6 MT-Bench is enough and the Apache licence matters; pick Llama 3.1 8B otherwise. See Llama 3.1 8B for the parallel writeup.

Production gotchas

Sliding window cuts genuine long-range tasks. If your workload requires integrating signal across 16k+ tokens (long-form summarisation, multi-document reasoning), Llama 3.1 8B’s full attention is the safer pick.
v0.3 tokenizer is not compatible with v0.1/v0.2. Migrating from older Mistral 7B versions changes token counts and may invalidate cached prompts.
Function calling tokens require explicit prompt format. Use [AVAILABLE_TOOLS] and [TOOL_CALLS] tags; bare JSON-in-prompt does not always trigger the tool path.
FP8 quality on creative writing degrades faster than on factual tasks. Validate on your eval if creative output matters.
Continuous batching with sliding window is well-tested in vLLM but flaky in older TGI builds. Use TGI 2.3+ for production.
The 32k context limit is a Mistral architecture choice, not a hardware one. Do not try to extend beyond 32k with rope scaling; quality collapses past 40k.
Pre-quantised FP8 checkpoints are scarce vs Llama. Use neuralmagic‘s where available; otherwise runtime calibration adds 30-60 seconds at startup.

Verdict and when to pick Mistral 7B

Pick Mistral 7B v0.3 on a 4090 24GB when: you need maximum single-stream throughput at 7B-class quality; you want Apache 2.0 licence with no acceptable-use restrictions or MAU caps; your workload fits the 4k sliding window (RAG with small chunks, short conversational history, classification, tool-call); you serve 16-32 concurrent users where the sliding window keeps KV bounded. Skip Mistral 7B if: you need long-range integration across 16k+ tokens (move to Llama 3.1 8B or Mistral Nemo 12B); you need code generation matching GPT-4o mini (move to Llama 3.1 8B or larger); you need MMLU above 65 (move to Llama 3.1 8B at minimum, ideally 70B AWQ).

For a 12-engineer coding team, Mistral 7B is the wrong pick for the primary code model (HumanEval 30 is too low) but is the right pick for a fast secondary model handling auto-complete suggestions where latency dominates quality. For a 200-MAU SaaS RAG with chunked retrieval, Mistral 7B is the right primary choice and the £329/month hosted 4090 supports the realistic peak comfortably.

Mistral 7B at 215 t/s on UK hosting

Apache 2.0 licensed, blistering fast on Ada FP8, 32k context with sliding window. UK dedicated hosting.

Order the RTX 4090 24GB

RTX 4090 24GB for Mistral 7B v0.3: Throughput and Sizing

Contents

Why Mistral 7B v0.3 is still relevant

VRAM accounting and the sliding window math

Throughput by batch and quant

Latency profile: TTFT vs prompt length

Deployment options: vLLM, TGI, Ollama, llama.cpp

Real workloads

vs Llama 3.1 8B head-to-head

Production gotchas

Verdict and when to pick Mistral 7B

Mistral 7B at 215 t/s on UK hosting

Need a Dedicated GPU Server?

gigagpu

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

RTX 4090 24GB for Mistral 7B v0.3: Throughput and Sizing

Contents

Why Mistral 7B v0.3 is still relevant

VRAM accounting and the sliding window math

Throughput by batch and quant

Latency profile: TTFT vs prompt length

Deployment options: vLLM, TGI, Ollama, llama.cpp

Real workloads

vs Llama 3.1 8B head-to-head

Production gotchas

Verdict and when to pick Mistral 7B

Mistral 7B at 215 t/s on UK hosting

Need a Dedicated GPU Server?

gigagpu

Related Articles

Mixtral VRAM Requirements (8x7B, 8x22B)

RTX 5060 Ti 16GB for DeepSeek R1 Distill 7B

LTX Video Self-Hosted

Stable Video Diffusion Deployment

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?