RTX 3050 - Order Now
Home / Blog / Benchmarks / Mistral 7B and Mistral Small 22B Benchmarks Across Every GPU We Host
Benchmarks

Mistral 7B and Mistral Small 22B Benchmarks Across Every GPU We Host

Real tokens-per-second, time-to-first-token and cost-per-million-tokens numbers for Mistral 7B Instruct and Mistral Small 22B on every GPU in the GigaGPU catalogue, FP16, FP8 and AWQ-INT4.

Mistral 7B is the most-deployed open-weight model in our customer base — fits everything from an RTX 3050 to a 6000 Pro, supports function calling, ships under Apache 2.0. Mistral Small 22B is the bigger sibling that’s quietly become the default for teams that need more reasoning headroom without jumping to 70B-class hardware. This page is the consolidated benchmark table we use to size deployments.

TL;DR

For Mistral 7B FP16, the RTX 5090 hits ~1,200 tok/s aggregate (cheapest cost per million tokens). For latency-bound single-stream workloads the RTX 5080 wins on TTFT. For Mistral Small 22B you need 24 GB at FP16 — RTX 4090 or RTX 3090 works at INT4, 5090 / 6000 Pro for FP8.

Benchmark setup

  • vLLM 0.6.3 with continuous batching enabled
  • Locust 2.x driver with 50 concurrent users
  • Prompt distribution: 30% 200-token, 50% 1,000-token, 20% 4,000-token
  • Output capped at 256 tokens
  • 10-minute warm runs, results from the steady-state window
  • Ubuntu 22.04, NVIDIA driver 555.x for Blackwell, 535.x for Ampere/Ada

Mistral 7B Instruct, FP16

The reference deployment. ~14 GB weights, fits any 16+ GB card.

GPUVRAMAggregate tok/sSingle-stream tok/sMedian TTFTp99 TTFT
RTX 3050 6 GB6 GB — does not fit
RTX 4060 8 GB8 GB — does not fit
RTX 3060 12 GB12 GB — does not fit
RTX 5060 Ti 16 GB16 GB58062180 ms420 ms
RTX 508016 GB82095120 ms280 ms
RTX 309024 GB72058220 ms540 ms
RTX 409024 GB95082160 ms360 ms
RTX 509032 GB1,18092130 ms300 ms
RTX 6000 Pro 96 GB96 GB1,14088140 ms320 ms
A100 80 GB80 GB1,31078170 ms390 ms

Mistral 7B Instruct, FP8 (Blackwell native)

Hardware FP8 on 5080/5090/6000 Pro lands ~1.5–1.7× the FP16 aggregate throughput. Quality regression is <0.5% on standard evals.

GPUAggregate tok/sSingle-stream tok/svs FP16
RTX 5060 Ti 16 GB88076+52%
RTX 50801,290120+57%
RTX 50901,920118+63%
RTX 6000 Pro 96 GB1,860110+63%

Mistral 7B Instruct, AWQ-INT4

AWQ-INT4 brings the model to ~4.5 GB. Useful for 8-12 GB cards or for stacking multiple models on one card.

GPUAggregate tok/sNotes
RTX 3050 6 GB180Tight, ~2-3K context max
RTX 4060 8 GB280Comfortable for INT4
RTX 3060 12 GB31012 GB lets you run longer context
RTX 5060 Ti 16 GB540Best entry-tier
RTX 50902,100Best aggregate, with KV pool to spare

Mistral Small 22B

22B parameters, ~44 GB at FP16, 22 GB at FP8, ~12 GB at AWQ-INT4. Mistral Small uses a 32K context window natively.

GPUPrecisionAggregate tok/sNotes
RTX 3090AWQ-INT4280Comfortable fit
RTX 4090AWQ-INT4340Comfortable fit
RTX 5090AWQ-INT4540Best single-card cost-per-token
RTX 5090FP8Tight22 GB weights + KV — 32 GB just barely
RTX 6000 ProFP8680Comfortable, recommended
RTX 6000 ProFP16410Reference quality

Cost per 1M tokens

Calculated as (monthly_price_GBP) / (aggregate_tok/s × 60 × 60 × 24 × 30) × 1,000,000, assuming 60% steady-state utilisation.

GPUMonthly costAggregate tok/s (Mistral 7B FP8)Cost per 1M tokens
RTX 3050 6 GB£79INT4 only~£0.41
RTX 5060 Ti 16 GB£119880£0.12
RTX 5080£1891,290£0.11
RTX 3090£159720 (FP16 only)£0.16
RTX 5090£3991,920£0.12
RTX 6000 Pro£8991,860£0.38

Lower is better. The 5080 is the cost leader at low-medium concurrency; the 5090 wins on absolute throughput.

Verdict — which card is the best Mistral host?

If your traffic profile is steady and high-concurrency, the RTX 5090 is the best Mistral 7B host we have. If you need lowest single-stream latency for a chatbot that feels instant, the RTX 5080 wins. For Mistral Small 22B, the RTX 6000 Pro is the right home if you have budget; otherwise a single RTX 5090 at AWQ-INT4 is the cheapest practical deployment.

Bottom line

For most teams choosing a Mistral host today: RTX 5090 + FP8. Lowest cost-per-token, plenty of VRAM headroom for KV cache and second models, mature stack. Drop to RTX 3090 if cost is the only consideration; step up to RTX 6000 Pro if you need ECC or you’re running Mistral Small at FP16.

For full GPU-by-GPU sizing across other models, see best GPU for LLM inference.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?