Table of Contents
VRAM Check: Mixtral MoE on 24 GB
Mixtral 8x7B from Mistral AI uses a mixture-of-experts architecture with 46.7B total parameters. Although only two of the eight experts activate per token, all weights must reside in VRAM. The RTX 3090 with 24 GB makes it feasible but tight on a dedicated GPU server:
| Precision | Model VRAM | KV Cache (2K ctx) | Total | Fits RTX 3090? |
|---|---|---|---|---|
| FP16 | ~93 GB | ~3 GB | ~96 GB | No |
| INT8 (GPTQ) | ~47 GB | ~3 GB | ~50 GB | No |
| INT4 (GPTQ) | ~24 GB | ~3 GB | ~27 GB | Tight (over budget at 2K+) |
| GGUF Q3_K_M | ~20 GB | ~2.5 GB | ~22.5 GB | Yes (1.5 GB spare) |
| GGUF Q4_K_S | ~23 GB | ~2.5 GB | ~25.5 GB | Tight |
Mixtral at Q3_K_M is the only reliable fit on 24 GB, leaving about 1.5 GB for KV cache at short context lengths. For full VRAM analysis, see our Mixtral 8x7B VRAM requirements guide.
Setup with vLLM
# Install vLLM
pip install vllm
# Launch Mixtral at 4-bit with tight memory settings
python -m vllm.entrypoints.openai.api_server \
--model TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ \
--quantization gptq \
--dtype float16 \
--max-model-len 2048 \
--gpu-memory-utilization 0.95 \
--port 8000
# Test the endpoint
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ",
"messages": [{"role": "user", "content": "Compare MoE and dense model architectures."}],
"max_tokens": 256
}'
Note the restricted --max-model-len 2048 and high memory utilisation. For a comparison of serving frameworks, read our vLLM vs Ollama guide.
Setup with Ollama
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Pull Mixtral (auto-selects quantisation)
ollama run mixtral:8x7b-instruct-v0.1-q3_K_M
# Serve as API
ollama serve &
curl http://localhost:11434/api/generate \
-d '{"model": "mixtral:8x7b-instruct-v0.1-q3_K_M", "prompt": "Hello from Mixtral!"}'
Ollama with llama.cpp is often more memory-efficient for single-user inference of large GGUF models.
RTX 3090 Benchmark Results
Benchmarked with llama.cpp at Q3_K_M, 256-token input, 128-token output. See the benchmark tool for more data.
| Configuration | Prompt tok/s | Gen tok/s | TTFT | Context Limit |
|---|---|---|---|---|
| Q3_K_M, batch 1 | 1,850 | 42 | 138 ms | ~2K |
| Q4_K_S, batch 1 | 1,640 | 38 | 156 ms | ~1.5K |
| GPTQ INT4 (vLLM), batch 1 | 1,720 | 36 | 149 ms | ~2K |
At 42 tok/s, Mixtral on the RTX 3090 is usable for single-user interactive chat but significantly slower than a LLaMA 3 8B deployment on the same card. The MoE architecture trades speed for quality at this VRAM tier.
Optimisation Tips
- Use Q3_K_M quantisation for the most reliable fit on 24 GB with room for a minimal KV cache.
- Set context to 2K tokens or less to avoid OOM errors. Longer contexts push total VRAM past 24 GB.
- Set
--gpu-memory-utilization 0.95in vLLM to use every available byte. - Disable concurrent batching since there is no VRAM headroom for multiple requests.
- Consider LLaMA 3 8B instead if you do not specifically need Mixtral’s MoE quality. It runs 3x faster on the same GPU.
Compare Mixtral versus LLaMA 3 in our Mixtral vs LLaMA 3 70B comparison. Use the cost calculator to estimate operating costs.
Next Steps
For comfortable Mixtral deployment with longer context, upgrade to an RTX 5090 with 32 GB or consider a dual-GPU setup. For lighter models that perform well on the RTX 3090, browse the model guides section. Check GPU pricing with the cheapest GPU for AI inference guide.
Deploy Mixtral 8x7B Now
Run Mixtral MoE inference on a dedicated RTX 3090 server. Full root access and UK data centre hosting.
Browse GPU Servers