Home / Blog / Model Guides / Run Mixtral 8x7B on RTX 3090 (MoE Deployment)

Model Guides

Run Mixtral 8x7B on RTX 3090 (MoE Deployment)

Guide to deploying Mixtral 8x7B on an RTX 3090 with 24 GB VRAM. Covers VRAM constraints for MoE models, quantisation setup, benchmarks, and optimisation tips.

Model Guides April 14, 2026 3 min read admin

Table of Contents

VRAM Check: Mixtral MoE on 24 GB
Setup with vLLM
Setup with Ollama
RTX 3090 Benchmark Results
Optimisation Tips
Next Steps

VRAM Check: Mixtral MoE on 24 GB

Mixtral 8x7B from Mistral AI uses a mixture-of-experts architecture with 46.7B total parameters. Although only two of the eight experts activate per token, all weights must reside in VRAM. The RTX 3090 with 24 GB makes it feasible but tight on a dedicated GPU server:

Precision	Model VRAM	KV Cache (2K ctx)	Total	Fits RTX 3090?
FP16	~93 GB	~3 GB	~96 GB	No
INT8 (GPTQ)	~47 GB	~3 GB	~50 GB	No
INT4 (GPTQ)	~24 GB	~3 GB	~27 GB	Tight (over budget at 2K+)
GGUF Q3_K_M	~20 GB	~2.5 GB	~22.5 GB	Yes (1.5 GB spare)
GGUF Q4_K_S	~23 GB	~2.5 GB	~25.5 GB	Tight

Mixtral at Q3_K_M is the only reliable fit on 24 GB, leaving about 1.5 GB for KV cache at short context lengths. For full VRAM analysis, see our Mixtral 8x7B VRAM requirements guide.

Setup with vLLM

# Install vLLM
pip install vllm

# Launch Mixtral at 4-bit with tight memory settings
python -m vllm.entrypoints.openai.api_server \
  --model TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ \
  --quantization gptq \
  --dtype float16 \
  --max-model-len 2048 \
  --gpu-memory-utilization 0.95 \
  --port 8000

# Test the endpoint
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ",
    "messages": [{"role": "user", "content": "Compare MoE and dense model architectures."}],
    "max_tokens": 256
  }'

Note the restricted --max-model-len 2048 and high memory utilisation. For a comparison of serving frameworks, read our vLLM vs Ollama guide.

Setup with Ollama

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull Mixtral (auto-selects quantisation)
ollama run mixtral:8x7b-instruct-v0.1-q3_K_M

# Serve as API
ollama serve &
curl http://localhost:11434/api/generate \
  -d '{"model": "mixtral:8x7b-instruct-v0.1-q3_K_M", "prompt": "Hello from Mixtral!"}'

Ollama with llama.cpp is often more memory-efficient for single-user inference of large GGUF models.

RTX 3090 Benchmark Results

Benchmarked with llama.cpp at Q3_K_M, 256-token input, 128-token output. See the benchmark tool for more data.

Configuration	Prompt tok/s	Gen tok/s	TTFT	Context Limit
Q3_K_M, batch 1	1,850	42	138 ms	~2K
Q4_K_S, batch 1	1,640	38	156 ms	~1.5K
GPTQ INT4 (vLLM), batch 1	1,720	36	149 ms	~2K

At 42 tok/s, Mixtral on the RTX 3090 is usable for single-user interactive chat but significantly slower than a LLaMA 3 8B deployment on the same card. The MoE architecture trades speed for quality at this VRAM tier.

Optimisation Tips

Use Q3_K_M quantisation for the most reliable fit on 24 GB with room for a minimal KV cache.
Set context to 2K tokens or less to avoid OOM errors. Longer contexts push total VRAM past 24 GB.
Set --gpu-memory-utilization 0.95 in vLLM to use every available byte.
Disable concurrent batching since there is no VRAM headroom for multiple requests.
Consider LLaMA 3 8B instead if you do not specifically need Mixtral’s MoE quality. It runs 3x faster on the same GPU.

Compare Mixtral versus LLaMA 3 in our Mixtral vs LLaMA 3 70B comparison. Use the cost calculator to estimate operating costs.

Next Steps

For comfortable Mixtral deployment with longer context, upgrade to an RTX 5090 with 32 GB or consider a dual-GPU setup. For lighter models that perform well on the RTX 3090, browse the model guides section. Check GPU pricing with the cheapest GPU for AI inference guide.

Deploy Mixtral 8x7B Now

Run Mixtral MoE inference on a dedicated RTX 3090 server. Full root access and UK data centre hosting.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Model Guides

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Run Mixtral 8x7B on RTX 3090 (MoE Deployment)

VRAM Check: Mixtral MoE on 24 GB

Setup with vLLM

Setup with Ollama

RTX 3090 Benchmark Results

Optimisation Tips

Next Steps

Deploy Mixtral 8x7B Now

Need a Dedicated GPU Server?

admin

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

Run Mixtral 8x7B on RTX 3090 (MoE Deployment)

VRAM Check: Mixtral MoE on 24 GB

Setup with vLLM

Setup with Ollama

RTX 3090 Benchmark Results

Optimisation Tips

Next Steps

Deploy Mixtral 8x7B Now

Need a Dedicated GPU Server?

admin

Related Articles

Run Gemma 2 on a Dedicated GPU Server

LLaMA 3 8B for Code Generation & Review: GPU Requirements & Setup

SDXL VRAM Requirements (Base, Refiner, Turbo)

Phi-3.5 vs Phi-3: What Microsoft Improved

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?