Table of Contents
Why Choose Mistral for Self-Hosted AI
Mistral AI has built a reputation for delivering high-performance language models that punch well above their parameter count. From the compact Mistral 7B to the mixture-of-experts Mixtral 8x7B, these models offer an excellent balance of speed and capability. Deploying Mistral on a dedicated GPU server lets you control throughput, latency, and data residency without relying on third-party APIs.
GigaGPU’s Mistral hosting platform provides bare-metal GPU infrastructure optimised for Mistral inference. Whether you need a lightweight 7B model for fast responses or the Mixtral 8x22B for complex reasoning, self-hosting on dedicated hardware removes per-token costs and keeps sensitive prompts private. This guide covers every step from environment setup to production deployment.
GPU VRAM Requirements for Mistral Models
Mistral models vary significantly in VRAM requirements. The sliding-window attention in Mistral 7B makes it particularly memory-efficient. For a broader GPU comparison, see our best GPU for LLM inference guide.
| Model | Precision | VRAM Required | Recommended GPU |
|---|---|---|---|
| Mistral 7B | FP16 | ~14 GB | 1x RTX 5090 |
| Mistral 7B | AWQ 4-bit | ~5 GB | 1x RTX 3090 |
| Mixtral 8x7B | FP16 | ~90 GB | 2x RTX 6000 Pro 96 GB |
| Mixtral 8x7B | AWQ 4-bit | ~26 GB | 1x RTX 6000 Pro |
| Mixtral 8x22B | FP16 | ~280 GB | 4x RTX 6000 Pro 96 GB |
| Mistral Large | FP16 | ~240 GB | 4x RTX 6000 Pro 96 GB |
For the largest models, GigaGPU’s multi-GPU cluster hosting provides NVLink-connected nodes for tensor-parallel inference.
Preparing Your GPU Server
Ensure your server has the NVIDIA driver and CUDA toolkit installed. Verify your GPU is detected:
sudo apt update && sudo apt upgrade -y
nvidia-smi
Set up a Python virtual environment and install the core dependencies:
python3 -m venv ~/mistral-env
source ~/mistral-env/bin/activate
pip install --upgrade pip
Install PyTorch with CUDA support. Our PyTorch GPU installation guide covers edge cases:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
Log in to Hugging Face to access gated model weights:
pip install huggingface_hub
huggingface-cli login
Deploying Mistral with vLLM
vLLM is the recommended engine for production Mistral deployments thanks to its PagedAttention and continuous batching support. Read our vLLM vs Ollama comparison to understand the trade-offs between engines.
Install vLLM:
pip install vllm
Launch Mistral 7B Instruct as an OpenAI-compatible server:
python -m vllm.entrypoints.openai.api_server \
--model mistralai/Mistral-7B-Instruct-v0.3 \
--dtype float16 \
--max-model-len 32768 \
--port 8000 \
--tensor-parallel-size 1
For Mixtral 8x7B across two GPUs:
python -m vllm.entrypoints.openai.api_server \
--model mistralai/Mixtral-8x7B-Instruct-v0.1 \
--dtype float16 \
--max-model-len 32768 \
--port 8000 \
--tensor-parallel-size 2
For production configuration details, see our vLLM production setup guide. GigaGPU also offers managed vLLM hosting with Mistral models pre-loaded.
Deploying Mistral with Ollama
Ollama provides the fastest path from zero to a running Mistral endpoint:
curl -fsSL https://ollama.com/install.sh | sh
Pull and run Mistral 7B:
ollama pull mistral
ollama run mistral
For Mixtral 8x7B:
ollama pull mixtral
ollama run mixtral
Serve on all network interfaces for remote access:
OLLAMA_HOST=0.0.0.0:11434 ollama serve
GigaGPU’s dedicated Ollama hosting comes with GPU drivers and Ollama pre-installed.
Testing the API Endpoint
Send a test completion request to the vLLM server:
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mistralai/Mistral-7B-Instruct-v0.3",
"messages": [{"role": "user", "content": "What is mixture-of-experts architecture?"}],
"max_tokens": 256,
"temperature": 0.7
}'
Compare your throughput with our tokens-per-second benchmark to verify optimal performance.
Production Tuning and Next Steps
To get the most from your Mistral deployment:
- Quantise for efficiency — AWQ 4-bit Mixtral 8x7B fits on a single RTX 6000 Pro, dramatically reducing cost.
- Leverage sliding window attention — Mistral 7B uses a 4096-token sliding window, so keep
--max-model-lenreasonable to save VRAM. - Scale with tensor parallelism — Distribute Mixtral 8x22B across four GPUs for consistent low-latency responses.
- Monitor costs — Use our cost-per-million-tokens calculator to compare self-hosting against API pricing.
- Pick the right hardware — Our RTX 3090 vs RTX 5090 benchmark helps you choose the best value GPU.
If you are also evaluating other model families, read our guides on how to deploy LLaMA 3 and deploy Qwen on dedicated GPU servers. Browse all of our model deployment guides for more options.
Run Mistral on Bare-Metal GPU Infrastructure
Deploy Mistral 7B, Mixtral 8x7B, or Mistral Large on dedicated NVIDIA GPUs with full root access, pre-installed CUDA drivers, and zero per-token fees.
Browse GPU Servers