Mistral 7B v0.3 remains popular thanks to its permissive Apache 2.0 licence and strong instruction-following. Here are the numbers on the RTX 5060 Ti 16GB at our dedicated GPU hosting.
Contents
Setup
- vLLM 0.6.4, CUDA 12.6, FlashAttention 2.6
- Model: mistralai/Mistral-7B-Instruct-v0.3
- Context 32k, GQA 8 KV heads, 32 layers
Decode Throughput
128 in / 512 out, batch 1:
| Precision | VRAM | t/s |
|---|---|---|
| FP16 | 14.2 GB | 58 (tight, minimal KV) |
| FP8 E4M3 | 7.2 GB | 118 |
| FP8 + FP8 KV | 6.9 GB | 122 |
| AWQ INT4 | 4.8 GB | 142 |
| GGUF Q4_K_M | 4.3 GB | 102 |
| EXL2 4.0 bpw | 4.2 GB | 152 |
Mistral 7B slightly edges Llama 3 8B at the same precision because it’s a billion parameters smaller. FP16 just fits – not recommended at this VRAM, prefer FP8.
Prefill Throughput
- FP8: 7,200 input t/s
- AWQ INT4: 4,500 input t/s
- GGUF Q4: 3,400 input t/s
- EXL2 4.0 bpw: 5,500 input t/s
Concurrency Scaling
| Users | Total t/s (FP8+FP8 KV) | Per user | p99 TTFT |
|---|---|---|---|
| 1 | 122 | 122 | 170 ms |
| 4 | 385 | 96 | 290 ms |
| 8 | 545 | 68 | 460 ms |
| 16 | 680 | 43 | 750 ms |
| 32 | 770 | 24 | 1,400 ms |
vs Llama 3 8B
| Metric | Mistral 7B | Llama 3 8B |
|---|---|---|
| FP8 decode (batch 1) | 122 t/s | 112 t/s |
| FP8 max aggregate | 770 t/s | 720 t/s |
| VRAM at FP8 | 7.2 GB | 8.0 GB |
| MMLU (published) | 60.8 | 68.4 |
| HumanEval (published) | 30.5 | 62.2 |
Mistral 7B wins on raw speed; Llama 3 8B wins on quality. For general chat and content generation Mistral 7B is still competitive. For code or reasoning, Llama 3 8B is meaningfully better.
Mistral 7B on Blackwell 16GB
122 t/s FP8, Apache 2.0 licensed. UK dedicated hosting.
Order the RTX 5060 Ti 16GBSee also: Llama 3 8B benchmark, monthly cost, FP8 deployment, AWQ guide, EXL2 guide.