Table of Contents
Mistral VRAM Requirements Overview
Mistral AI offers models ranging from the efficient 7B to the flagship Mistral Large at 123B parameters. The Mixtral line uses a Mixture-of-Experts (MoE) architecture that needs more total VRAM than the active parameter count suggests. This guide covers every Mistral variant to help you pick the right dedicated GPU server for Mistral hosting.
Mistral 7B introduced sliding window attention (4,096 tokens) and grouped-query attention, making it exceptionally efficient for its size. Mixtral 8x7B has 46.7B total parameters but only activates ~12.9B per token, making it fast at inference despite the large footprint.
Complete VRAM Table (All Models)
| Model | Parameters | FP32 | FP16 | INT8 | INT4 |
|---|---|---|---|---|---|
| Mistral 7B v0.3 | 7.3B | ~29 GB | ~14.5 GB | ~7.5 GB | ~4.5 GB |
| Mistral 7B Instruct | 7.3B | ~29 GB | ~14.5 GB | ~7.5 GB | ~4.5 GB |
| Mistral Nemo 12B | 12.2B | ~49 GB | ~24.5 GB | ~12.5 GB | ~7.5 GB |
| Mixtral 8x7B | 46.7B (MoE) | ~187 GB | ~93 GB | ~47 GB | ~26 GB |
| Mixtral 8x7B Instruct | 46.7B (MoE) | ~187 GB | ~93 GB | ~47 GB | ~26 GB |
| Mixtral 8x22B | 141B (MoE) | ~564 GB | ~282 GB | ~141 GB | ~75 GB |
| Mistral Small (22B) | 22B | ~88 GB | ~44 GB | ~22 GB | ~13 GB |
| Mistral Large (123B) | 123B | ~492 GB | ~246 GB | ~123 GB | ~66 GB |
Note: Mixtral MoE models require VRAM for all experts even though only 2 of 8 are active per token. This means Mixtral 8x7B needs roughly the same VRAM as a dense 47B model despite running at ~13B model speed. For similar models, see our LLaMA 3 VRAM requirements page.
Which GPU Do You Need?
| GPU | VRAM | Best Mistral Model | Precision | Use Case |
|---|---|---|---|---|
| RTX 3050 | 8 GB | Mistral 7B | 4-bit | Dev / testing |
| RTX 4060 | 8 GB | Mistral 7B | 4-bit / Q6_K | Dev / personal |
| RTX 4060 Ti | 16 GB | Mistral 7B / Nemo 12B | FP16 / INT8 | Small production |
| RTX 3090 | 24 GB | Nemo 12B / Mixtral 8x7B | FP16 / 4-bit | Production |
| 2x RTX 3090 | 48 GB | Mixtral 8x7B / Small | INT8 / FP16 | High quality |
| 4x RTX 3090 | 96 GB | Mistral Large | 4-bit | Full capability |
For a specific GPU-model pairing analysis, read our RTX 4060 + Mistral 7B article.
Context Length Impact on VRAM
Mistral models support various context lengths. KV cache VRAM scales linearly:
| Context | 7B KV Cache | Nemo 12B KV | Mixtral 8x7B KV | Small 22B KV |
|---|---|---|---|---|
| 4,096 | ~0.5 GB | ~0.8 GB | ~1.5 GB | ~1.5 GB |
| 8,192 | ~1 GB | ~1.6 GB | ~3 GB | ~3 GB |
| 16,384 | ~2 GB | ~3.2 GB | ~6 GB | ~6 GB |
| 32,768 | ~4 GB | ~6.4 GB | ~12 GB | ~12 GB |
Mistral 7B’s sliding window attention (4K window) means the effective KV cache is capped at 4K tokens regardless of input length, keeping VRAM usage predictable. Newer models like Nemo and Mixtral use full attention with longer contexts.
Batch Size Impact on VRAM
| Model (4-bit, 4K ctx) | Batch 1 | Batch 4 | Batch 8 | Batch 16 |
|---|---|---|---|---|
| Mistral 7B | ~5 GB | ~7 GB | ~9 GB | ~13 GB |
| Nemo 12B | ~8.5 GB | ~12 GB | ~15 GB | ~22 GB |
| Mixtral 8x7B | ~28 GB | ~34 GB | ~40 GB | ~52 GB |
Mistral 7B is exceptionally batch-friendly due to its small KV cache. On a 24 GB GPU, you can serve 16+ concurrent users at 4-bit quantization. This makes it one of the most cost-effective models for production APIs.
Practical Deployment Recommendations
- Budget chatbot: Mistral 7B on RTX 4060 (4-bit). 24-28 tok/s, handles single users.
- Quality chatbot: Mistral 7B on RTX 4060 Ti (FP16). 35 tok/s, 2-3 concurrent users.
- Production API: Mistral 7B on RTX 3090 (FP16 or INT8). 40-55 tok/s, 8+ concurrent users.
- Higher capability: Mixtral 8x7B on 2x RTX 3090 (4-bit). MoE gives you near-7B speed with 13B+ quality.
- Maximum quality: Mistral Large on multi-GPU cluster. Enterprise-grade reasoning.
For pricing analysis, see our cost per 1M tokens: GPU vs API comparison and the LLM cost calculator.
Quick Setup Commands
Ollama
curl -fsSL https://ollama.com/install.sh | sh
ollama run mistral:7b # 7B
ollama run mixtral:8x7b # Mixtral 8x7B (needs 26+ GB at 4-bit)
vLLM
# Mistral 7B FP16 on RTX 3090
vllm serve mistralai/Mistral-7B-Instruct-v0.3 \
--dtype float16 --max-model-len 4096
# Mistral 7B AWQ on RTX 4060
vllm serve TheBloke/Mistral-7B-Instruct-v0.2-AWQ \
--quantization awq --max-model-len 4096
For full deployment guides, see our Ollama hosting and vLLM hosting pages. Compare with similar models on our best GPU for LLM inference page and use the benchmark tool for speed comparisons.
Deploy This Model Now
Dedicated GPU servers with the VRAM you need. UK datacenter, full root access.
Browse GPU Servers