Home / Blog / Benchmarks / Mistral Large Tokens/sec by GPU

Benchmarks

Mistral Large Tokens/sec by GPU

Benchmark data for Mistral Large inference speed across GPUs with quantisation comparisons and cost-per-token analysis for UK dedicated GPU hosting.

Benchmarks April 14, 2026 2 min read admin

Table of Contents

Mistral Large Benchmark Overview
Tokens/sec Results by GPU
Quantisation Impact on Speed
Cost Efficiency Analysis
GPU Recommendations
Conclusion

Mistral Large Benchmark Overview

Mistral Large is Mistral AI’s flagship dense model with 123 billion parameters, designed for complex reasoning, multilingual tasks, and code generation. At this scale, running it requires substantial GPU resources, making it a model best suited for high-end dedicated GPU servers. We benchmark inference speed to help you plan your deployment.

Testing used vLLM on GigaGPU servers with a 512-token input and 256-token output. Mistral Large at FP16 requires approximately 246 GB, and even INT4 needs roughly 62 GB. Multi-GPU configurations are mandatory. See our tokens per second benchmark hub for methodology.

Tokens/sec Results by GPU

No single consumer GPU can run Mistral Large. The table shows multi-GPU configurations with INT4 quantisation.

Configuration	Total VRAM	Mistral Large INT4 (tok/s)	Notes
Single RTX 3090	24 GB	N/A	Insufficient VRAM
Single RTX 5090	32 GB	N/A	Insufficient VRAM
2x RTX 5090	64 GB	5 tok/s	Tight fit with offloading
4x RTX 3090	96 GB	6 tok/s	INT4 fits across 4 GPUs
4x RTX 5090	128 GB	14 tok/s	Comfortable with headroom

Mistral Large is a heavyweight that demands 4-GPU configurations for practical use. The RTX 5090 quad setup at 14 tok/s is the minimum for interactive applications.

Quantisation Impact on Speed

Given the model’s size, only INT4 is practical on consumer hardware. Below we compare the INT4 results with INT8 where VRAM permits. For quantisation analysis, see our FP16 vs INT8 vs INT4 comparison.

Configuration	INT8 (tok/s)	INT4 (tok/s)
4x RTX 3090 (96 GB)	N/A (needs ~123 GB)	6 tok/s
4x RTX 5090 (128 GB)	10 tok/s	14 tok/s

INT4 is approximately 40% faster than INT8 on the 4x RTX 5090 setup. For Mistral Large, INT4 is the recommended precision unless your use case demands maximum accuracy.

Cost Efficiency Analysis

Configuration	INT4 tok/s	Approx. Monthly Cost	tok/s per Pound
4x RTX 3090	6	~£400	0.015
4x RTX 5090	14	~£920	0.015

Both configurations offer similar cost efficiency. The choice comes down to whether 6 tok/s or 14 tok/s meets your latency requirements. For the best GPU for Mistral, consider smaller Mistral models if budget is a primary concern.

GPU Recommendations

Minimum viable: 4x RTX 3090 — 6 tok/s at INT4 for batch processing and offline tasks.
Recommended: 4x RTX 5090 — 14 tok/s for moderate-traffic interactive applications.
Alternative: Consider Mixtral 8x7B for a much lighter Mistral family option, or the Qwen 2.5 72B for a similarly capable but smaller model.

For smaller model benchmarks from Mistral, check the Qwen 2.5 7B benchmark as a size comparison reference. Browse all data in the Benchmarks category.

Conclusion

Mistral Large is best suited for teams with the budget for multi-GPU servers and a genuine need for top-tier model quality. At 123B parameters, it pushes even 4-GPU consumer setups to their limits but delivers exceptional reasoning and multilingual capabilities in return.

Enterprise GPU Servers for Mistral Large

Multi-GPU dedicated servers with up to 4x GPUs, full root access, and UK-based hosting for lowest latency.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Benchmarks

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Mistral Large Tokens/sec by GPU

Mistral Large Benchmark Overview

Tokens/sec Results by GPU

Quantisation Impact on Speed

Cost Efficiency Analysis

GPU Recommendations

Conclusion

Enterprise GPU Servers for Mistral Large

Need a Dedicated GPU Server?

admin

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

Mistral Large Tokens/sec by GPU

Mistral Large Benchmark Overview

Tokens/sec Results by GPU

Quantisation Impact on Speed

Cost Efficiency Analysis

GPU Recommendations

Conclusion

Enterprise GPU Servers for Mistral Large

Need a Dedicated GPU Server?

admin

Related Articles

LLaMA 3 8B on RTX 3050: Performance Benchmark & Cost, Category: Benchmarks, Slug: llama-3-8b-on-rtx-3050-benchmark, Excerpt: LLaMA 3 8B benchmarked on RTX 3050: 8 tok/s at 4-bit GGUF Q4_K_M, VRAM usage, cost per 1M tokens, and deployment configuration., Internal links: 9 –>

Mixed Precision Training Guide

Qwen 2.5 7B on RTX 4060 Ti: Performance Benchmark & Cost, Category: Benchmarks, Slug: qwen-2.5-7b-on-rtx-4060-ti-benchmark, Excerpt: Qwen 2.5 7B benchmarked on RTX 4060 Ti: 31.3 tok/s at FP16, VRAM usage, cost per 1M tokens, and deployment configuration., Internal links: 9 –>

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?