RTX 3050 - Order Now
Home / Blog / Tutorials / Ollama on RTX 5090: Running Large Models in 32GB
Tutorials

Ollama on RTX 5090: Running Large Models in 32GB

The RTX 5090's 32GB GDDR7 unlocks Mixtral 8x7B, 34B models in high quality, and dual-model setups in Ollama. Full compatibility guide with performance benchmarks.

What 32GB Unlocks for Ollama

The RTX 5090 pushes Ollama into a new tier. With 32GB GDDR7 at 1,792 GB/s bandwidth on a dedicated GPU server, you can run models that simply do not fit on 24GB cards. Mixtral 8x7B in Q4, CodeLlama 34B with comfortable context windows, 14B models in FP16, and multi-model configurations that previously required two GPUs all become possible on a single card.

Combined with Blackwell architecture’s native FP4 support, the 5090 also makes smaller models dramatically faster. For the serving-engine comparison, see vLLM vs Ollama.

Large Model Compatibility Table

ModelQuantisationVRAM UsedRTX 3090 (24GB)RTX 5090 (32GB)
Llama 3 8BFP16~16 GBYesYes
Llama 3 13BFP16~26 GBNoYes
DeepSeek R1 14BFP16~28 GBNoYes
Qwen 2.5 14BFP16~28 GBNoYes
CodeLlama 34BQ4_K_M~20 GBTightYes (12GB free)
Mixtral 8x7BQ4_K_M~26 GBNoYes
Llama 3 70BQ4_K_M~40 GBNoNo
Llama 3 70BQ2_K~26 GBNoYes (low quality)

The key unlocks are 13B-14B FP16 (no quantisation loss), Mixtral 8x7B (the most capable open mixture-of-experts model), and 34B models with enough headroom for long context. The 70B tier remains out of reach for comfortable single-GPU use. For 70B, see multi-GPU cluster hosting.

Setup and Configuration

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Run Mixtral 8x7B — now fits in a single GPU
ollama run mixtral:8x7b

# 13B FP16 for maximum quality
ollama run llama3:13b

# DeepSeek R1 14B full precision
ollama run deepseek-r1:14b-fp16

# Expose API
OLLAMA_HOST=0.0.0.0 ollama serve

On the 5090, Ollama automatically utilises GDDR7 bandwidth for faster token generation. No special configuration is needed beyond ensuring CUDA 12.8+ drivers are installed. See the CUDA installation guide for details.

Performance Benchmarks

ModelQuantisationTokens/s (RTX 5090)Tokens/s (RTX 3090)
Llama 3 8BQ4_K_M~155~82
Llama 3 8BFP16~105~48
Llama 3 13BFP16~62OOM
DeepSeek R1 14BFP16~55OOM
Mixtral 8x7BQ4_K_M~38OOM
CodeLlama 34BQ4_K_M~32~18

The 5090’s bandwidth advantage means even models that fit on the 3090 run nearly 2x faster. Mixtral at 38 tokens/s is fast enough for real-time chat with its superior reasoning capability. Check more numbers on the tokens-per-second benchmark.

Dual-Model and Multi-Model Setups

With 32GB, the 5090 can keep two models loaded simultaneously in Ollama:

# Chat model + Code model simultaneously
# Both stay in VRAM — no swapping delay
ollama pull llama3:8b          # ~5.5 GB Q4
ollama pull codellama:13b      # ~8.5 GB Q4
# Total: ~14 GB, leaving 18 GB for KV cache

# Or a chat model + embedding model for RAG
ollama pull llama3:8b          # ~5.5 GB
ollama pull nomic-embed-text   # ~0.5 GB
# Total: ~6 GB, 26 GB free for context

This enables chatbot and AI search pipelines on a single GPU without model-switching latency.

RTX 5090 vs RTX 3090 for Ollama

Choose the RTX 5090 for Ollama when you need 13B-14B models without quantisation, Mixtral 8x7B, multi-model serving, or maximum speed for 7B-8B models. The RTX 3090 remains the value choice when 24GB VRAM covers your model needs and speed is less critical. See the full RTX 3090 to 5090 upgrade analysis for cost-benefit calculations.

For production API serving at scale, consider switching from Ollama to vLLM on the RTX 5090. Explore the full range of options in the tutorials section and calculate costs with the LLM cost calculator.

RTX 5090: Run Large Models with Ollama

32GB GDDR7 unlocks Mixtral, 13B FP16, and multi-model serving. Dedicated hardware, full root access.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?