RTX 3050 - Order Now
Home / Blog / Benchmarks / RTX 5060 Ti 16GB Mistral 7B Benchmark
Benchmarks

RTX 5060 Ti 16GB Mistral 7B Benchmark

Mistral 7B v0.3 on Blackwell 16GB - measured decode, prefill, and concurrency numbers across FP8, AWQ, and GGUF.

Mistral 7B v0.3 remains popular thanks to its permissive Apache 2.0 licence and strong instruction-following. Here are the numbers on the RTX 5060 Ti 16GB at our dedicated GPU hosting.

Contents

Setup

  • vLLM 0.6.4, CUDA 12.6, FlashAttention 2.6
  • Model: mistralai/Mistral-7B-Instruct-v0.3
  • Context 32k, GQA 8 KV heads, 32 layers

Decode Throughput

128 in / 512 out, batch 1:

PrecisionVRAMt/s
FP1614.2 GB58 (tight, minimal KV)
FP8 E4M37.2 GB118
FP8 + FP8 KV6.9 GB122
AWQ INT44.8 GB142
GGUF Q4_K_M4.3 GB102
EXL2 4.0 bpw4.2 GB152

Mistral 7B slightly edges Llama 3 8B at the same precision because it’s a billion parameters smaller. FP16 just fits – not recommended at this VRAM, prefer FP8.

Prefill Throughput

  • FP8: 7,200 input t/s
  • AWQ INT4: 4,500 input t/s
  • GGUF Q4: 3,400 input t/s
  • EXL2 4.0 bpw: 5,500 input t/s

Concurrency Scaling

UsersTotal t/s (FP8+FP8 KV)Per userp99 TTFT
1122122170 ms
438596290 ms
854568460 ms
1668043750 ms
32770241,400 ms

vs Llama 3 8B

MetricMistral 7BLlama 3 8B
FP8 decode (batch 1)122 t/s112 t/s
FP8 max aggregate770 t/s720 t/s
VRAM at FP87.2 GB8.0 GB
MMLU (published)60.868.4
HumanEval (published)30.562.2

Mistral 7B wins on raw speed; Llama 3 8B wins on quality. For general chat and content generation Mistral 7B is still competitive. For code or reasoning, Llama 3 8B is meaningfully better.

Mistral 7B on Blackwell 16GB

122 t/s FP8, Apache 2.0 licensed. UK dedicated hosting.

Order the RTX 5060 Ti 16GB

See also: Llama 3 8B benchmark, monthly cost, FP8 deployment, AWQ guide, EXL2 guide.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?