Mistral 7B v0.3 is the leanest, fastest open model still in active production use, and on the RTX 4090 24GB it is the throughput champion of its class: 215 tokens per second per stream at FP8 with comfortable 32k context, an Apache 2.0 licence and a sliding window attention scheme that keeps KV memory bounded regardless of input length. On UK dedicated GPU hosting it is the right pick when raw speed and a permissive licence matter more than the extra benchmark points Llama 3.1 8B brings. This guide walks through the VRAM accounting, the sliding window math that makes long context cheap, throughput by batch and quant, latency profile, deployment options across vLLM, TGI, Ollama and llama.cpp, real workload sizings, and the head-to-head with Llama 3.1 8B.
Contents
- Why Mistral 7B v0.3 is still relevant
- VRAM accounting and the sliding window math
- Throughput by batch and quant
- Latency profile: TTFT vs prompt length
- Deployment options: vLLM, TGI, Ollama, llama.cpp
- Real workloads
- vs Llama 3.1 8B head-to-head
- Production gotchas
- Verdict and when to pick Mistral 7B
Why Mistral 7B v0.3 is still relevant
Mistral 7B v0.3 (Mistral AI, May 2024 update) has 7.25B parameters, 32 layers, 8 KV heads (grouped-query attention), and a 32k effective context with sliding window attention at 4k window. The v0.3 update added a proper extended tokenizer (32k vocabulary, up from 32k but with reserved tokens for tool-use), a tokenizer fix for code, and function-calling tokens. The Apache 2.0 licence is the headline distinction from Llama 3.1: there are no acceptable-use clauses and no MAU cap. Combined, those properties make Mistral 7B the canonical “fast and free” model.
For tooling, RAG, customer support and “small fast generator behind a product surface” use cases it is still the model to beat in its tier. It loses to Llama 3.1 8B on most benchmarks (62.5 MMLU vs 68.4, 30.5 HumanEval vs 62.2) but wins on raw decoded throughput (215 t/s vs 198 t/s at FP8), on KV cost at long context (sliding window caps it), on licence permissiveness, and on the breadth of community fine-tunes that ship for it. For a workload where the answer quality from a 62.5 MMLU model is acceptable and the throughput matters, Mistral 7B is the right call.
VRAM accounting and the sliding window math
Mistral 7B v0.3 has 7.25B parameters. The per-token KV cost for full attention would be 2 (K, V) x 8 (KV heads) x 128 (dim) x 32 (layers) x 2 bytes (FP16) = 131 KB per token, or 65.5 KB at FP8. With sliding window attention at 4k window, the KV is capped at 4k tokens regardless of input length: a 32k input still only stores 4k tokens of KV per layer. The full-attention KV at 32k would be 4.2 GB; sliding window keeps it at 0.5 GB.
| Component | FP16 | FP8 | AWQ INT4 |
|---|---|---|---|
| Weights | 14.5 GB | 7.25 GB | 4.1 GB |
| Activations + workspace | 0.7 GB | 0.7 GB | 0.7 GB |
| CUDA + driver overhead | 0.6 GB | 0.6 GB | 0.6 GB |
| KV per token (FP8, full attn) | 0.065 KB | 0.065 KB | 0.065 KB |
| KV @ 32k FP8 (sliding 4k window) | 0.5 GB | 0.5 GB | 0.5 GB |
| KV @ 32k FP8 (full attention) | 2.1 GB | 2.1 GB | 2.1 GB |
| Total @ 32k sliding, 1 stream | 16.3 GB | 9.0 GB | 6.0 GB |
| Total @ 32k sliding, 32 streams (FP8) | doesn’t fit | 24.5 GB (OOM) | 21.7 GB |
| Total @ 32k sliding, 16 streams (FP8) | doesn’t fit | 16.5 GB | 13.4 GB |
The sliding window attention caps KV growth: a 4k window means you never store more than 4k tokens of KV per layer regardless of input length. That is why 32k context is so cheap on Mistral compared to a vanilla full-attention 8B model. It also has a quality cost: the model can only attend to the most recent 4k tokens at each layer, so genuinely long-range dependencies lose fidelity. For RAG with chunked retrieval where each retrieval is small, this is invisible; for tasks that genuinely need to integrate signal across 32k tokens, Llama 3.1 8B with full attention is the better pick.
Throughput by batch and quant
Measured on a 4090 24GB with vLLM 0.6.3, FlashAttention 3, sliding window attention enabled, CUDA 12.4. Sustained over a 5-minute window after warmup.
| Precision | Single-user t/s | Batch 8 t/s aggregate | Batch 16 t/s aggregate | Batch 32 t/s aggregate |
|---|---|---|---|---|
| FP16 | 105 | 540 | 720 | 820 (plateau) |
| FP8 E4M3 | 215 | 800 | 980 | 1080 (plateau) |
| AWQ INT4 | 240 | 820 | 1020 | 1120 (plateau) |
| GPTQ INT4 | 225 | 790 | 980 | 1070 |
FP8 is the practical winner: ~2x FP16 throughput at MMLU within 0.4 of the reference, and a much shorter critical path than INT4 quantisation. AWQ INT4 wins on raw throughput but loses 1-2 points on MT-Bench and HumanEval. Mistral 7B FP8 at 215 t/s single-stream is faster than Llama 3.1 8B at 198 t/s; at batch 32 the aggregate gap closes to 1080 vs 1100. The plateau at batch 32 is the bandwidth wall described in the GDDR6X bandwidth piece.
Latency profile: TTFT vs prompt length
| Prompt length | TTFT (FP8) | Inter-token (FP8) | Notes |
|---|---|---|---|
| 256 tokens | 70 ms | 4.6 ms | Chat reply |
| 1k tokens | 120 ms | 4.6 ms | Short RAG |
| 2k tokens | 200 ms | 4.7 ms | Standard RAG |
| 4k tokens | 360 ms | 4.7 ms | Long RAG |
| 8k tokens | 740 ms | 4.8 ms | Document QA |
| 16k tokens | 1500 ms | 5.0 ms | Edge of sliding window utility |
| 32k tokens | 3300 ms | 5.2 ms | Long context |
TTFT under 400 ms holds up to 4k input. The inter-token time stays nearly flat across context lengths because the sliding window caps attention KV, so the per-token attention cost does not grow with prompt length the way it does for full-attention models. This is the defining performance characteristic of Mistral 7B: long context comes nearly free at decode time.
Deployment options: vLLM, TGI, Ollama, llama.cpp
| Stack | Best for | FP8 | Sliding window | Function calling |
|---|---|---|---|---|
| vLLM 0.6+ | Production OpenAI-compatible | Native | Yes | Yes (with guidance) |
| TGI | HF ecosystem | Yes (TRT-LLM backend) | Yes | Yes |
| Ollama | Single-user dev | GGUF only | Yes | Limited |
| llama.cpp server | Smallest stack, mixed CPU+GPU | GGUF only | Yes | Limited |
| mistral-inference (official) | Reference implementation | BF16 only | Yes | Native |
Production vLLM launch:
pip install "vllm>=0.6.0"
python -m vllm.entrypoints.openai.api_server \
--model mistralai/Mistral-7B-Instruct-v0.3 \
--quantization fp8 \ # E4M3 weights
--kv-cache-dtype fp8 \ # FP8 KV
--max-model-len 32768 \ # 32k full context
--sliding-window 4096 \ # Mistral default
--max-num-seqs 32 \ # high concurrency thanks to small KV
--enable-prefix-caching \
--enable-chunked-prefill \
--gpu-memory-utilization 0.92 \
--port 8000
Set --max-num-seqs 32 aggressively: the small KV footprint thanks to the sliding window means you can serve heavy concurrent traffic without OOM. Production sizing on a single 4090 running this configuration: ~80M output tokens/day at FP8, comfortable 32 active streams, p95 TTFT under 250 ms for 4k prompts. For a function-calling workload, add --guided-decoding-backend lm-format-enforcer and supply a JSON schema; Mistral 7B v0.3 reliably emits valid JSON when constrained.
# For smaller deployments, llama.cpp with GGUF Q4_K_M
./llama-server \
--model mistral-7b-v0.3-q4_k_m.gguf \
--ctx-size 32768 \
--batch-size 256 \
--n-gpu-layers 33 \
--port 8001
llama.cpp single-stream throughput is ~165 t/s for Q4_K_M, well below vLLM but with a much smaller stack. The right choice for prototyping or single-user dev tools; not the right choice for production multi-user serving.
Real workloads
| Workload | Concurrent active | Context | Throughput | p95 TTFT | £/M tokens |
|---|---|---|---|---|---|
| 200-MAU SaaS RAG (10 concurrent) | 10 streams | 8k chunks | ~720 t/s aggregate | 180 ms | £0.13 |
| Customer support bot (24 conc.) | 24 streams | 4k history | ~960 t/s aggregate | 200 ms | £0.10 |
| Tool-call agent backend | 16 streams | 16k context | ~880 t/s aggregate | 520 ms | £0.11 |
| Document classification batch | 32 streams | 2k input, 256 out | ~1080 t/s aggregate | n/a (batch) | £0.08 |
For a 200-MAU SaaS RAG running Mistral 7B FP8 on a single 4090, the £329/month hosted cost amortises to roughly £0.13 per million tokens served. For a 12-engineer team using Mistral 7B as a fast fallback model alongside Llama 3.1 8B for higher quality, the same card serves both at ~880 t/s aggregate with FP8 weights and FP8 KV.
vs Llama 3.1 8B head-to-head
| Metric | Mistral 7B v0.3 | Llama 3.1 8B |
|---|---|---|
| Parameters | 7.25B | 8.03B |
| FP8 single-user t/s | 215 | 198 |
| FP8 batch 32 aggregate t/s | 1080 | 1100 |
| Native context | 32k (sliding 4k) | 128k (full) |
| KV at 32k context | 0.5 GB (sliding) | 2.1 GB (full) |
| MMLU | 60.8 | 68.4 |
| GSM8K | 52.1 | 84.5 |
| HumanEval | 30.5 | 62.2 |
| MT-Bench | 7.6 | 8.30 |
| Tool-calling tokens | Yes (v0.3) | Yes (v3.1) |
| Licence | Apache 2.0 | Llama 3 Community (700M MAU cap) |
| Long-range attention | Limited (4k window) | Full 128k |
Llama 3.1 8B wins on quality benchmarks across the board, often by 6-30 points. Mistral 7B wins on raw single-stream speed (~9 percent faster at FP8), on the simplicity of its sliding window for very long contexts, and on licence permissiveness (Apache 2.0 has no acceptable-use clauses). Pick Mistral 7B for high-volume latency-critical pipelines where 7.6 MT-Bench is enough and the Apache licence matters; pick Llama 3.1 8B otherwise. See Llama 3.1 8B for the parallel writeup.
Production gotchas
- Sliding window cuts genuine long-range tasks. If your workload requires integrating signal across 16k+ tokens (long-form summarisation, multi-document reasoning), Llama 3.1 8B’s full attention is the safer pick.
- v0.3 tokenizer is not compatible with v0.1/v0.2. Migrating from older Mistral 7B versions changes token counts and may invalidate cached prompts.
- Function calling tokens require explicit prompt format. Use
[AVAILABLE_TOOLS]and[TOOL_CALLS]tags; bare JSON-in-prompt does not always trigger the tool path. - FP8 quality on creative writing degrades faster than on factual tasks. Validate on your eval if creative output matters.
- Continuous batching with sliding window is well-tested in vLLM but flaky in older TGI builds. Use TGI 2.3+ for production.
- The 32k context limit is a Mistral architecture choice, not a hardware one. Do not try to extend beyond 32k with rope scaling; quality collapses past 40k.
- Pre-quantised FP8 checkpoints are scarce vs Llama. Use
neuralmagic‘s where available; otherwise runtime calibration adds 30-60 seconds at startup.
Verdict and when to pick Mistral 7B
Pick Mistral 7B v0.3 on a 4090 24GB when: you need maximum single-stream throughput at 7B-class quality; you want Apache 2.0 licence with no acceptable-use restrictions or MAU caps; your workload fits the 4k sliding window (RAG with small chunks, short conversational history, classification, tool-call); you serve 16-32 concurrent users where the sliding window keeps KV bounded. Skip Mistral 7B if: you need long-range integration across 16k+ tokens (move to Llama 3.1 8B or Mistral Nemo 12B); you need code generation matching GPT-4o mini (move to Llama 3.1 8B or larger); you need MMLU above 65 (move to Llama 3.1 8B at minimum, ideally 70B AWQ).
For a 12-engineer coding team, Mistral 7B is the wrong pick for the primary code model (HumanEval 30 is too low) but is the right pick for a fast secondary model handling auto-complete suggestions where latency dominates quality. For a 200-MAU SaaS RAG with chunked retrieval, Mistral 7B is the right primary choice and the £329/month hosted 4090 supports the realistic peak comfortably.
Mistral 7B at 215 t/s on UK hosting
Apache 2.0 licensed, blistering fast on Ada FP8, 32k context with sliding window. UK dedicated hosting.
Order the RTX 4090 24GBSee also: Llama 3.1 8B comparison, Mistral Nemo 12B, Mistral Small 3, FP8 deployment, vLLM setup, Mistral 7B benchmark, SaaS RAG, coding assistant, 8B LLM VRAM.