Home / Blog / Tutorials / Ollama on RTX 5090: Running Large Models in 32GB

Tutorials

Ollama on RTX 5090: Running Large Models in 32GB

The RTX 5090's 32GB GDDR7 unlocks Mixtral 8x7B, 34B models in high quality, and dual-model setups in Ollama. Full compatibility guide with performance benchmarks.

Tutorials April 17, 2026 3 min read admin

Table of Contents

What 32GB Unlocks for Ollama
Large Model Compatibility Table
Setup and Configuration
Performance Benchmarks
Dual-Model and Multi-Model Setups
RTX 5090 vs RTX 3090 for Ollama

What 32GB Unlocks for Ollama

The RTX 5090 pushes Ollama into a new tier. With 32GB GDDR7 at 1,792 GB/s bandwidth on a dedicated GPU server, you can run models that simply do not fit on 24GB cards. Mixtral 8x7B in Q4, CodeLlama 34B with comfortable context windows, 14B models in FP16, and multi-model configurations that previously required two GPUs all become possible on a single card.

Combined with Blackwell architecture’s native FP4 support, the 5090 also makes smaller models dramatically faster. For the serving-engine comparison, see vLLM vs Ollama.

Large Model Compatibility Table

Model	Quantisation	VRAM Used	RTX 3090 (24GB)	RTX 5090 (32GB)
Llama 3 8B	FP16	~16 GB	Yes	Yes
Llama 3 13B	FP16	~26 GB	No	Yes
DeepSeek R1 14B	FP16	~28 GB	No	Yes
Qwen 2.5 14B	FP16	~28 GB	No	Yes
CodeLlama 34B	Q4_K_M	~20 GB	Tight	Yes (12GB free)
Mixtral 8x7B	Q4_K_M	~26 GB	No	Yes
Llama 3 70B	Q4_K_M	~40 GB	No	No
Llama 3 70B	Q2_K	~26 GB	No	Yes (low quality)

The key unlocks are 13B-14B FP16 (no quantisation loss), Mixtral 8x7B (the most capable open mixture-of-experts model), and 34B models with enough headroom for long context. The 70B tier remains out of reach for comfortable single-GPU use. For 70B, see multi-GPU cluster hosting.

Setup and Configuration

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Run Mixtral 8x7B — now fits in a single GPU
ollama run mixtral:8x7b

# 13B FP16 for maximum quality
ollama run llama3:13b

# DeepSeek R1 14B full precision
ollama run deepseek-r1:14b-fp16

# Expose API
OLLAMA_HOST=0.0.0.0 ollama serve

On the 5090, Ollama automatically utilises GDDR7 bandwidth for faster token generation. No special configuration is needed beyond ensuring CUDA 12.8+ drivers are installed. See the CUDA installation guide for details.

Performance Benchmarks

Model	Quantisation	Tokens/s (RTX 5090)	Tokens/s (RTX 3090)
Llama 3 8B	Q4_K_M	~155	~82
Llama 3 8B	FP16	~105	~48
Llama 3 13B	FP16	~62	OOM
DeepSeek R1 14B	FP16	~55	OOM
Mixtral 8x7B	Q4_K_M	~38	OOM
CodeLlama 34B	Q4_K_M	~32	~18

The 5090’s bandwidth advantage means even models that fit on the 3090 run nearly 2x faster. Mixtral at 38 tokens/s is fast enough for real-time chat with its superior reasoning capability. Check more numbers on the tokens-per-second benchmark.

Dual-Model and Multi-Model Setups

With 32GB, the 5090 can keep two models loaded simultaneously in Ollama:

# Chat model + Code model simultaneously
# Both stay in VRAM — no swapping delay
ollama pull llama3:8b          # ~5.5 GB Q4
ollama pull codellama:13b      # ~8.5 GB Q4
# Total: ~14 GB, leaving 18 GB for KV cache

# Or a chat model + embedding model for RAG
ollama pull llama3:8b          # ~5.5 GB
ollama pull nomic-embed-text   # ~0.5 GB
# Total: ~6 GB, 26 GB free for context

This enables chatbot and AI search pipelines on a single GPU without model-switching latency.

RTX 5090 vs RTX 3090 for Ollama

Choose the RTX 5090 for Ollama when you need 13B-14B models without quantisation, Mixtral 8x7B, multi-model serving, or maximum speed for 7B-8B models. The RTX 3090 remains the value choice when 24GB VRAM covers your model needs and speed is less critical. See the full RTX 3090 to 5090 upgrade analysis for cost-benefit calculations.

For production API serving at scale, consider switching from Ollama to vLLM on the RTX 5090. Explore the full range of options in the tutorials section and calculate costs with the LLM cost calculator.

RTX 5090: Run Large Models with Ollama

32GB GDDR7 unlocks Mixtral, 13B FP16, and multi-model serving. Dedicated hardware, full root access.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Tutorials

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Ollama on RTX 5090: Running Large Models in 32GB

What 32GB Unlocks for Ollama

Large Model Compatibility Table

Setup and Configuration

Performance Benchmarks

Dual-Model and Multi-Model Setups

RTX 5090 vs RTX 3090 for Ollama

RTX 5090: Run Large Models with Ollama

Need a Dedicated GPU Server?

admin

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

Ollama on RTX 5090: Running Large Models in 32GB

What 32GB Unlocks for Ollama

Large Model Compatibility Table

Setup and Configuration

Performance Benchmarks

Dual-Model and Multi-Model Setups

RTX 5090 vs RTX 3090 for Ollama

RTX 5090: Run Large Models with Ollama

Need a Dedicated GPU Server?

admin

Related Articles

Prometheus + Grafana: GPU Monitoring

FAISS vs Qdrant vs Weaviate vs ChromaDB: Vector DB Comparison

OCR + LLM Document Summarisation Pipeline

Whisper Accuracy Issues: Improvement Guide

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?