Home / Blog / Benchmarks / Mistral 7B and Mistral Small 22B Benchmarks Across Every GPU We Host

Benchmarks

Mistral 7B and Mistral Small 22B Benchmarks Across Every GPU We Host

Real tokens-per-second, time-to-first-token and cost-per-million-tokens numbers for Mistral 7B Instruct and Mistral Small 22B on every GPU in the GigaGPU catalogue, FP16, FP8 and AWQ-INT4.

Benchmarks May 4, 2026 3 min read gigagpu

Table of Contents

Mistral 7B is the most-deployed open-weight model in our customer base — fits everything from an RTX 3050 to a 6000 Pro, supports function calling, ships under Apache 2.0. Mistral Small 22B is the bigger sibling that’s quietly become the default for teams that need more reasoning headroom without jumping to 70B-class hardware. This page is the consolidated benchmark table we use to size deployments.

TL;DR

For Mistral 7B FP16, the RTX 5090 hits ~1,200 tok/s aggregate (cheapest cost per million tokens). For latency-bound single-stream workloads the RTX 5080 wins on TTFT. For Mistral Small 22B you need 24 GB at FP16 — RTX 4090 or RTX 3090 works at INT4, 5090 / 6000 Pro for FP8.

Benchmark setup

vLLM 0.6.3 with continuous batching enabled
Locust 2.x driver with 50 concurrent users
Prompt distribution: 30% 200-token, 50% 1,000-token, 20% 4,000-token
Output capped at 256 tokens
10-minute warm runs, results from the steady-state window
Ubuntu 22.04, NVIDIA driver 555.x for Blackwell, 535.x for Ampere/Ada

Mistral 7B Instruct, FP16

The reference deployment. ~14 GB weights, fits any 16+ GB card.

GPU	VRAM	Aggregate tok/s	Single-stream tok/s	Median TTFT	p99 TTFT
RTX 3050 6 GB	6 GB — does not fit	—	—	—	—
RTX 4060 8 GB	8 GB — does not fit	—	—	—	—
RTX 3060 12 GB	12 GB — does not fit	—	—	—	—
RTX 5060 Ti 16 GB	16 GB	580	62	180 ms	420 ms
RTX 5080	16 GB	820	95	120 ms	280 ms
RTX 3090	24 GB	720	58	220 ms	540 ms
RTX 4090	24 GB	950	82	160 ms	360 ms
RTX 5090	32 GB	1,180	92	130 ms	300 ms
RTX 6000 Pro 96 GB	96 GB	1,140	88	140 ms	320 ms
A100 80 GB	80 GB	1,310	78	170 ms	390 ms

Mistral 7B Instruct, FP8 (Blackwell native)

Hardware FP8 on 5080/5090/6000 Pro lands ~1.5–1.7× the FP16 aggregate throughput. Quality regression is <0.5% on standard evals.

GPU	Aggregate tok/s	Single-stream tok/s	vs FP16
RTX 5060 Ti 16 GB	880	76	+52%
RTX 5080	1,290	120	+57%
RTX 5090	1,920	118	+63%
RTX 6000 Pro 96 GB	1,860	110	+63%

Mistral 7B Instruct, AWQ-INT4

AWQ-INT4 brings the model to ~4.5 GB. Useful for 8-12 GB cards or for stacking multiple models on one card.

GPU	Aggregate tok/s	Notes
RTX 3050 6 GB	180	Tight, ~2-3K context max
RTX 4060 8 GB	280	Comfortable for INT4
RTX 3060 12 GB	310	12 GB lets you run longer context
RTX 5060 Ti 16 GB	540	Best entry-tier
RTX 5090	2,100	Best aggregate, with KV pool to spare

Mistral Small 22B

22B parameters, ~44 GB at FP16, 22 GB at FP8, ~12 GB at AWQ-INT4. Mistral Small uses a 32K context window natively.

GPU	Precision	Aggregate tok/s	Notes
RTX 3090	AWQ-INT4	280	Comfortable fit
RTX 4090	AWQ-INT4	340	Comfortable fit
RTX 5090	AWQ-INT4	540	Best single-card cost-per-token
RTX 5090	FP8	Tight	22 GB weights + KV — 32 GB just barely
RTX 6000 Pro	FP8	680	Comfortable, recommended
RTX 6000 Pro	FP16	410	Reference quality

Cost per 1M tokens

Calculated as (monthly_price_GBP) / (aggregate_tok/s × 60 × 60 × 24 × 30) × 1,000,000, assuming 60% steady-state utilisation.

GPU	Monthly cost	Aggregate tok/s (Mistral 7B FP8)	Cost per 1M tokens
RTX 3050 6 GB	£79	INT4 only	~£0.41
RTX 5060 Ti 16 GB	£119	880	£0.12
RTX 5080	£189	1,290	£0.11
RTX 3090	£159	720 (FP16 only)	£0.16
RTX 5090	£399	1,920	£0.12
RTX 6000 Pro	£899	1,860	£0.38

Lower is better. The 5080 is the cost leader at low-medium concurrency; the 5090 wins on absolute throughput.

Verdict — which card is the best Mistral host?

If your traffic profile is steady and high-concurrency, the RTX 5090 is the best Mistral 7B host we have. If you need lowest single-stream latency for a chatbot that feels instant, the RTX 5080 wins. For Mistral Small 22B, the RTX 6000 Pro is the right home if you have budget; otherwise a single RTX 5090 at AWQ-INT4 is the cheapest practical deployment.

Bottom line

For most teams choosing a Mistral host today: RTX 5090 + FP8. Lowest cost-per-token, plenty of VRAM headroom for KV cache and second models, mature stack. Drop to RTX 3090 if cost is the only consideration; step up to RTX 6000 Pro if you need ECC or you’re running Mistral Small at FP16.

For full GPU-by-GPU sizing across other models, see best GPU for LLM inference.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Benchmarks

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Mistral 7B and Mistral Small 22B Benchmarks Across Every GPU We Host

Benchmark setup

Mistral 7B Instruct, FP16

Mistral 7B Instruct, FP8 (Blackwell native)

Mistral 7B Instruct, AWQ-INT4

Mistral Small 22B

Cost per 1M tokens

Verdict — which card is the best Mistral host?

Bottom line

Need a Dedicated GPU Server?

gigagpu

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

Mistral 7B and Mistral Small 22B Benchmarks Across Every GPU We Host

Benchmark setup

Mistral 7B Instruct, FP16

Mistral 7B Instruct, FP8 (Blackwell native)

Mistral 7B Instruct, AWQ-INT4

Mistral Small 22B

Cost per 1M tokens

Verdict — which card is the best Mistral host?

Bottom line

Need a Dedicated GPU Server?

gigagpu

Related Articles

OCR + LLM Pipeline on RTX 5090: Performance Benchmark & Cost, Category: Benchmarks, Slug: ocr-llm-pipeline-on-rtx-5090-benchmark, Excerpt: PaddleOCR + LLaMA 3 8B concurrent pipeline benchmarked on RTX 5090: OCR pages/sec, LLM tokens/sec, VRAM breakdown, and cost analysis., Internal links: 9 –>

RTX 5060 Ti 16GB Benchmark Script

Qwen 2.5 7B on RTX 3090: Performance Benchmark & Cost, Category: Benchmarks, Slug: qwen-2.5-7b-on-rtx-3090-benchmark, Excerpt: Qwen 2.5 7B benchmarked on RTX 3090: 43.0 tok/s at FP16, VRAM usage, cost per 1M tokens, and deployment configuration., Internal links: 9 –>

Mistral 7B: 1 to 64 Concurrent Requests Throughput

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?