RTX 3050 - Order Now
Home / Blog / Model Guides / Run Mixtral 8x7B on RTX 3090 (MoE Deployment)
Model Guides

Run Mixtral 8x7B on RTX 3090 (MoE Deployment)

Guide to deploying Mixtral 8x7B on an RTX 3090 with 24 GB VRAM. Covers VRAM constraints for MoE models, quantisation setup, benchmarks, and optimisation tips.

VRAM Check: Mixtral MoE on 24 GB

Mixtral 8x7B from Mistral AI uses a mixture-of-experts architecture with 46.7B total parameters. Although only two of the eight experts activate per token, all weights must reside in VRAM. The RTX 3090 with 24 GB makes it feasible but tight on a dedicated GPU server:

PrecisionModel VRAMKV Cache (2K ctx)TotalFits RTX 3090?
FP16~93 GB~3 GB~96 GBNo
INT8 (GPTQ)~47 GB~3 GB~50 GBNo
INT4 (GPTQ)~24 GB~3 GB~27 GBTight (over budget at 2K+)
GGUF Q3_K_M~20 GB~2.5 GB~22.5 GBYes (1.5 GB spare)
GGUF Q4_K_S~23 GB~2.5 GB~25.5 GBTight

Mixtral at Q3_K_M is the only reliable fit on 24 GB, leaving about 1.5 GB for KV cache at short context lengths. For full VRAM analysis, see our Mixtral 8x7B VRAM requirements guide.

Setup with vLLM

# Install vLLM
pip install vllm

# Launch Mixtral at 4-bit with tight memory settings
python -m vllm.entrypoints.openai.api_server \
  --model TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ \
  --quantization gptq \
  --dtype float16 \
  --max-model-len 2048 \
  --gpu-memory-utilization 0.95 \
  --port 8000

# Test the endpoint
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ",
    "messages": [{"role": "user", "content": "Compare MoE and dense model architectures."}],
    "max_tokens": 256
  }'

Note the restricted --max-model-len 2048 and high memory utilisation. For a comparison of serving frameworks, read our vLLM vs Ollama guide.

Setup with Ollama

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull Mixtral (auto-selects quantisation)
ollama run mixtral:8x7b-instruct-v0.1-q3_K_M

# Serve as API
ollama serve &
curl http://localhost:11434/api/generate \
  -d '{"model": "mixtral:8x7b-instruct-v0.1-q3_K_M", "prompt": "Hello from Mixtral!"}'

Ollama with llama.cpp is often more memory-efficient for single-user inference of large GGUF models.

RTX 3090 Benchmark Results

Benchmarked with llama.cpp at Q3_K_M, 256-token input, 128-token output. See the benchmark tool for more data.

ConfigurationPrompt tok/sGen tok/sTTFTContext Limit
Q3_K_M, batch 11,85042138 ms~2K
Q4_K_S, batch 11,64038156 ms~1.5K
GPTQ INT4 (vLLM), batch 11,72036149 ms~2K

At 42 tok/s, Mixtral on the RTX 3090 is usable for single-user interactive chat but significantly slower than a LLaMA 3 8B deployment on the same card. The MoE architecture trades speed for quality at this VRAM tier.

Optimisation Tips

  • Use Q3_K_M quantisation for the most reliable fit on 24 GB with room for a minimal KV cache.
  • Set context to 2K tokens or less to avoid OOM errors. Longer contexts push total VRAM past 24 GB.
  • Set --gpu-memory-utilization 0.95 in vLLM to use every available byte.
  • Disable concurrent batching since there is no VRAM headroom for multiple requests.
  • Consider LLaMA 3 8B instead if you do not specifically need Mixtral’s MoE quality. It runs 3x faster on the same GPU.

Compare Mixtral versus LLaMA 3 in our Mixtral vs LLaMA 3 70B comparison. Use the cost calculator to estimate operating costs.

Next Steps

For comfortable Mixtral deployment with longer context, upgrade to an RTX 5090 with 32 GB or consider a dual-GPU setup. For lighter models that perform well on the RTX 3090, browse the model guides section. Check GPU pricing with the cheapest GPU for AI inference guide.

Deploy Mixtral 8x7B Now

Run Mixtral MoE inference on a dedicated RTX 3090 server. Full root access and UK data centre hosting.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?