RTX 3050 - Order Now
Home / Blog / Benchmarks / Mistral Large Tokens/sec by GPU
Benchmarks

Mistral Large Tokens/sec by GPU

Benchmark data for Mistral Large inference speed across GPUs with quantisation comparisons and cost-per-token analysis for UK dedicated GPU hosting.

Mistral Large Benchmark Overview

Mistral Large is Mistral AI’s flagship dense model with 123 billion parameters, designed for complex reasoning, multilingual tasks, and code generation. At this scale, running it requires substantial GPU resources, making it a model best suited for high-end dedicated GPU servers. We benchmark inference speed to help you plan your deployment.

Testing used vLLM on GigaGPU servers with a 512-token input and 256-token output. Mistral Large at FP16 requires approximately 246 GB, and even INT4 needs roughly 62 GB. Multi-GPU configurations are mandatory. See our tokens per second benchmark hub for methodology.

Tokens/sec Results by GPU

No single consumer GPU can run Mistral Large. The table shows multi-GPU configurations with INT4 quantisation.

ConfigurationTotal VRAMMistral Large INT4 (tok/s)Notes
Single RTX 309024 GBN/AInsufficient VRAM
Single RTX 509032 GBN/AInsufficient VRAM
2x RTX 509064 GB5 tok/sTight fit with offloading
4x RTX 309096 GB6 tok/sINT4 fits across 4 GPUs
4x RTX 5090128 GB14 tok/sComfortable with headroom

Mistral Large is a heavyweight that demands 4-GPU configurations for practical use. The RTX 5090 quad setup at 14 tok/s is the minimum for interactive applications.

Quantisation Impact on Speed

Given the model’s size, only INT4 is practical on consumer hardware. Below we compare the INT4 results with INT8 where VRAM permits. For quantisation analysis, see our FP16 vs INT8 vs INT4 comparison.

ConfigurationINT8 (tok/s)INT4 (tok/s)
4x RTX 3090 (96 GB)N/A (needs ~123 GB)6 tok/s
4x RTX 5090 (128 GB)10 tok/s14 tok/s

INT4 is approximately 40% faster than INT8 on the 4x RTX 5090 setup. For Mistral Large, INT4 is the recommended precision unless your use case demands maximum accuracy.

Cost Efficiency Analysis

ConfigurationINT4 tok/sApprox. Monthly Costtok/s per Pound
4x RTX 30906~£4000.015
4x RTX 509014~£9200.015

Both configurations offer similar cost efficiency. The choice comes down to whether 6 tok/s or 14 tok/s meets your latency requirements. For the best GPU for Mistral, consider smaller Mistral models if budget is a primary concern.

GPU Recommendations

  • Minimum viable: 4x RTX 3090 — 6 tok/s at INT4 for batch processing and offline tasks.
  • Recommended: 4x RTX 5090 — 14 tok/s for moderate-traffic interactive applications.
  • Alternative: Consider Mixtral 8x7B for a much lighter Mistral family option, or the Qwen 2.5 72B for a similarly capable but smaller model.

For smaller model benchmarks from Mistral, check the Qwen 2.5 7B benchmark as a size comparison reference. Browse all data in the Benchmarks category.

Conclusion

Mistral Large is best suited for teams with the budget for multi-GPU servers and a genuine need for top-tier model quality. At 123B parameters, it pushes even 4-GPU consumer setups to their limits but delivers exceptional reasoning and multilingual capabilities in return.

Enterprise GPU Servers for Mistral Large

Multi-GPU dedicated servers with up to 4x GPUs, full root access, and UK-based hosting for lowest latency.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?