RTX 3050 - Order Now
Home / Blog / Model Guides / Mistral VRAM Requirements (7B, 8x7B, Large)
Model Guides

Mistral VRAM Requirements (7B, 8x7B, Large)

Complete Mistral VRAM requirements for 7B, Mixtral 8x7B, Mistral Small, and Mistral Large. FP32, FP16, INT8, INT4 tables plus GPU recommendations.

Mistral VRAM Requirements Overview

Mistral AI offers models ranging from the efficient 7B to the flagship Mistral Large at 123B parameters. The Mixtral line uses a Mixture-of-Experts (MoE) architecture that needs more total VRAM than the active parameter count suggests. This guide covers every Mistral variant to help you pick the right dedicated GPU server for Mistral hosting.

Mistral 7B introduced sliding window attention (4,096 tokens) and grouped-query attention, making it exceptionally efficient for its size. Mixtral 8x7B has 46.7B total parameters but only activates ~12.9B per token, making it fast at inference despite the large footprint.

Complete VRAM Table (All Models)

ModelParametersFP32FP16INT8INT4
Mistral 7B v0.37.3B~29 GB~14.5 GB~7.5 GB~4.5 GB
Mistral 7B Instruct7.3B~29 GB~14.5 GB~7.5 GB~4.5 GB
Mistral Nemo 12B12.2B~49 GB~24.5 GB~12.5 GB~7.5 GB
Mixtral 8x7B46.7B (MoE)~187 GB~93 GB~47 GB~26 GB
Mixtral 8x7B Instruct46.7B (MoE)~187 GB~93 GB~47 GB~26 GB
Mixtral 8x22B141B (MoE)~564 GB~282 GB~141 GB~75 GB
Mistral Small (22B)22B~88 GB~44 GB~22 GB~13 GB
Mistral Large (123B)123B~492 GB~246 GB~123 GB~66 GB

Note: Mixtral MoE models require VRAM for all experts even though only 2 of 8 are active per token. This means Mixtral 8x7B needs roughly the same VRAM as a dense 47B model despite running at ~13B model speed. For similar models, see our LLaMA 3 VRAM requirements page.

Which GPU Do You Need?

GPUVRAMBest Mistral ModelPrecisionUse Case
RTX 30508 GBMistral 7B4-bitDev / testing
RTX 40608 GBMistral 7B4-bit / Q6_KDev / personal
RTX 4060 Ti16 GBMistral 7B / Nemo 12BFP16 / INT8Small production
RTX 309024 GBNemo 12B / Mixtral 8x7BFP16 / 4-bitProduction
2x RTX 309048 GBMixtral 8x7B / SmallINT8 / FP16High quality
4x RTX 309096 GBMistral Large4-bitFull capability

For a specific GPU-model pairing analysis, read our RTX 4060 + Mistral 7B article.

Context Length Impact on VRAM

Mistral models support various context lengths. KV cache VRAM scales linearly:

Context7B KV CacheNemo 12B KVMixtral 8x7B KVSmall 22B KV
4,096~0.5 GB~0.8 GB~1.5 GB~1.5 GB
8,192~1 GB~1.6 GB~3 GB~3 GB
16,384~2 GB~3.2 GB~6 GB~6 GB
32,768~4 GB~6.4 GB~12 GB~12 GB

Mistral 7B’s sliding window attention (4K window) means the effective KV cache is capped at 4K tokens regardless of input length, keeping VRAM usage predictable. Newer models like Nemo and Mixtral use full attention with longer contexts.

Batch Size Impact on VRAM

Model (4-bit, 4K ctx)Batch 1Batch 4Batch 8Batch 16
Mistral 7B~5 GB~7 GB~9 GB~13 GB
Nemo 12B~8.5 GB~12 GB~15 GB~22 GB
Mixtral 8x7B~28 GB~34 GB~40 GB~52 GB

Mistral 7B is exceptionally batch-friendly due to its small KV cache. On a 24 GB GPU, you can serve 16+ concurrent users at 4-bit quantization. This makes it one of the most cost-effective models for production APIs.

Practical Deployment Recommendations

  • Budget chatbot: Mistral 7B on RTX 4060 (4-bit). 24-28 tok/s, handles single users.
  • Quality chatbot: Mistral 7B on RTX 4060 Ti (FP16). 35 tok/s, 2-3 concurrent users.
  • Production API: Mistral 7B on RTX 3090 (FP16 or INT8). 40-55 tok/s, 8+ concurrent users.
  • Higher capability: Mixtral 8x7B on 2x RTX 3090 (4-bit). MoE gives you near-7B speed with 13B+ quality.
  • Maximum quality: Mistral Large on multi-GPU cluster. Enterprise-grade reasoning.

For pricing analysis, see our cost per 1M tokens: GPU vs API comparison and the LLM cost calculator.

Quick Setup Commands

Ollama

curl -fsSL https://ollama.com/install.sh | sh
ollama run mistral:7b         # 7B
ollama run mixtral:8x7b       # Mixtral 8x7B (needs 26+ GB at 4-bit)

vLLM

# Mistral 7B FP16 on RTX 3090
vllm serve mistralai/Mistral-7B-Instruct-v0.3 \
  --dtype float16 --max-model-len 4096

# Mistral 7B AWQ on RTX 4060
vllm serve TheBloke/Mistral-7B-Instruct-v0.2-AWQ \
  --quantization awq --max-model-len 4096

For full deployment guides, see our Ollama hosting and vLLM hosting pages. Compare with similar models on our best GPU for LLM inference page and use the benchmark tool for speed comparisons.

Deploy This Model Now

Dedicated GPU servers with the VRAM you need. UK datacenter, full root access.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?