Table of Contents
Mistral 7B is the most-deployed open-weight model in our customer base — fits everything from an RTX 3050 to a 6000 Pro, supports function calling, ships under Apache 2.0. Mistral Small 22B is the bigger sibling that’s quietly become the default for teams that need more reasoning headroom without jumping to 70B-class hardware. This page is the consolidated benchmark table we use to size deployments.
For Mistral 7B FP16, the RTX 5090 hits ~1,200 tok/s aggregate (cheapest cost per million tokens). For latency-bound single-stream workloads the RTX 5080 wins on TTFT. For Mistral Small 22B you need 24 GB at FP16 — RTX 4090 or RTX 3090 works at INT4, 5090 / 6000 Pro for FP8.
Benchmark setup
- vLLM 0.6.3 with continuous batching enabled
- Locust 2.x driver with 50 concurrent users
- Prompt distribution: 30% 200-token, 50% 1,000-token, 20% 4,000-token
- Output capped at 256 tokens
- 10-minute warm runs, results from the steady-state window
- Ubuntu 22.04, NVIDIA driver 555.x for Blackwell, 535.x for Ampere/Ada
Mistral 7B Instruct, FP16
The reference deployment. ~14 GB weights, fits any 16+ GB card.
| GPU | VRAM | Aggregate tok/s | Single-stream tok/s | Median TTFT | p99 TTFT |
|---|---|---|---|---|---|
| RTX 3050 6 GB | 6 GB — does not fit | — | — | — | — |
| RTX 4060 8 GB | 8 GB — does not fit | — | — | — | — |
| RTX 3060 12 GB | 12 GB — does not fit | — | — | — | — |
| RTX 5060 Ti 16 GB | 16 GB | 580 | 62 | 180 ms | 420 ms |
| RTX 5080 | 16 GB | 820 | 95 | 120 ms | 280 ms |
| RTX 3090 | 24 GB | 720 | 58 | 220 ms | 540 ms |
| RTX 4090 | 24 GB | 950 | 82 | 160 ms | 360 ms |
| RTX 5090 | 32 GB | 1,180 | 92 | 130 ms | 300 ms |
| RTX 6000 Pro 96 GB | 96 GB | 1,140 | 88 | 140 ms | 320 ms |
| A100 80 GB | 80 GB | 1,310 | 78 | 170 ms | 390 ms |
Mistral 7B Instruct, FP8 (Blackwell native)
Hardware FP8 on 5080/5090/6000 Pro lands ~1.5–1.7× the FP16 aggregate throughput. Quality regression is <0.5% on standard evals.
| GPU | Aggregate tok/s | Single-stream tok/s | vs FP16 |
|---|---|---|---|
| RTX 5060 Ti 16 GB | 880 | 76 | +52% |
| RTX 5080 | 1,290 | 120 | +57% |
| RTX 5090 | 1,920 | 118 | +63% |
| RTX 6000 Pro 96 GB | 1,860 | 110 | +63% |
Mistral 7B Instruct, AWQ-INT4
AWQ-INT4 brings the model to ~4.5 GB. Useful for 8-12 GB cards or for stacking multiple models on one card.
| GPU | Aggregate tok/s | Notes |
|---|---|---|
| RTX 3050 6 GB | 180 | Tight, ~2-3K context max |
| RTX 4060 8 GB | 280 | Comfortable for INT4 |
| RTX 3060 12 GB | 310 | 12 GB lets you run longer context |
| RTX 5060 Ti 16 GB | 540 | Best entry-tier |
| RTX 5090 | 2,100 | Best aggregate, with KV pool to spare |
Mistral Small 22B
22B parameters, ~44 GB at FP16, 22 GB at FP8, ~12 GB at AWQ-INT4. Mistral Small uses a 32K context window natively.
| GPU | Precision | Aggregate tok/s | Notes |
|---|---|---|---|
| RTX 3090 | AWQ-INT4 | 280 | Comfortable fit |
| RTX 4090 | AWQ-INT4 | 340 | Comfortable fit |
| RTX 5090 | AWQ-INT4 | 540 | Best single-card cost-per-token |
| RTX 5090 | FP8 | Tight | 22 GB weights + KV — 32 GB just barely |
| RTX 6000 Pro | FP8 | 680 | Comfortable, recommended |
| RTX 6000 Pro | FP16 | 410 | Reference quality |
Cost per 1M tokens
Calculated as (monthly_price_GBP) / (aggregate_tok/s × 60 × 60 × 24 × 30) × 1,000,000, assuming 60% steady-state utilisation.
| GPU | Monthly cost | Aggregate tok/s (Mistral 7B FP8) | Cost per 1M tokens |
|---|---|---|---|
| RTX 3050 6 GB | £79 | INT4 only | ~£0.41 |
| RTX 5060 Ti 16 GB | £119 | 880 | £0.12 |
| RTX 5080 | £189 | 1,290 | £0.11 |
| RTX 3090 | £159 | 720 (FP16 only) | £0.16 |
| RTX 5090 | £399 | 1,920 | £0.12 |
| RTX 6000 Pro | £899 | 1,860 | £0.38 |
Lower is better. The 5080 is the cost leader at low-medium concurrency; the 5090 wins on absolute throughput.
Verdict — which card is the best Mistral host?
If your traffic profile is steady and high-concurrency, the RTX 5090 is the best Mistral 7B host we have. If you need lowest single-stream latency for a chatbot that feels instant, the RTX 5080 wins. For Mistral Small 22B, the RTX 6000 Pro is the right home if you have budget; otherwise a single RTX 5090 at AWQ-INT4 is the cheapest practical deployment.
Bottom line
For most teams choosing a Mistral host today: RTX 5090 + FP8. Lowest cost-per-token, plenty of VRAM headroom for KV cache and second models, mature stack. Drop to RTX 3090 if cost is the only consideration; step up to RTX 6000 Pro if you need ECC or you’re running Mistral Small at FP16.
For full GPU-by-GPU sizing across other models, see best GPU for LLM inference.