Home / Blog / Tutorials / Ollama on RTX 3090: What Models Fit in 24GB?

Tutorials

Ollama on RTX 3090: What Models Fit in 24GB?

Complete guide to which AI models run on the RTX 3090 with Ollama. Covers every model size from 7B to 70B, quantisation levels, and real-world performance in 24GB VRAM.

Tutorials April 17, 2026 3 min read admin

Table of Contents

Ollama and the RTX 3090: What You Get
Complete Model Compatibility Table
Install Ollama and Run Your First Model
Performance by Model Size
Running Multiple Models
Limits and When to Upgrade

Ollama and the RTX 3090: What You Get

Ollama is the simplest way to run LLMs locally, and the RTX 3090 is the sweet spot for model variety. With 24GB GDDR6X VRAM on a dedicated GPU server, Ollama can load everything from 7B models in full precision to 70B models in aggressive quantisation. No other consumer GPU at this price point gives you access to so many model tiers.

This guide maps every popular model to its VRAM requirement and tells you exactly what fits. For a comparison of Ollama versus production-focused vLLM, see our vLLM vs Ollama guide.

Complete Model Compatibility Table

Model	Quantisation	VRAM Used	Fits 24GB?	Ollama Tag
Llama 3 8B	Q4_K_M	~5.5 GB	Yes	`llama3:8b`
Llama 3 8B	FP16	~16 GB	Yes	`llama3:8b-fp16`
Mistral 7B	Q4_K_M	~5 GB	Yes	`mistral:7b`
DeepSeek R1 7B	Q4_K_M	~5 GB	Yes	`deepseek-r1:7b`
DeepSeek R1 14B	Q4_K_M	~9 GB	Yes	`deepseek-r1:14b`
Qwen 2.5 14B	Q4_K_M	~9.5 GB	Yes	`qwen2.5:14b`
CodeLlama 34B	Q4_K_M	~20 GB	Yes (tight)	`codellama:34b`
Mixtral 8x7B	Q4_K_M	~26 GB	No	`mixtral:8x7b`
Llama 3 70B	Q4_K_M	~40 GB	No	`llama3:70b`
Llama 3 70B	Q2_K	~26 GB	No (edge)	Custom GGUF

The RTX 3090 comfortably handles models up to 34B in Q4 quantisation. Models above 34B push past 24GB and require multi-GPU clusters or the RTX 5090 with 32GB.

Install Ollama and Run Your First Model

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull and run Llama 3 8B (default Q4_K_M)
ollama run llama3:8b

# Pull a larger model
ollama pull deepseek-r1:14b
ollama run deepseek-r1:14b

# FP16 for maximum quality (uses ~16GB)
ollama run llama3:8b-fp16

# Check which models are loaded
ollama list

Ollama automatically detects CUDA and uses the RTX 3090 for inference. For a detailed server setup walkthrough, see the Ollama dedicated GPU setup guide.

Performance by Model Size

Model	Quantisation	Tokens/s	Response Feel
Llama 3 8B	Q4_K_M	~82	Instant
Llama 3 8B	FP16	~48	Fast
Mistral 7B	Q4_K_M	~85	Instant
DeepSeek R1 14B	Q4_K_M	~42	Fast
Qwen 2.5 14B	Q4_K_M	~40	Fast
CodeLlama 34B	Q4_K_M	~18	Usable

Anything above 30 tokens/s feels real-time for interactive chat. The RTX 3090 keeps 7B-14B models well above that threshold. Check detailed benchmarks on the tokens-per-second benchmark tool.

Running Multiple Models

The 3090’s 24GB allows you to keep multiple smaller models loaded. Ollama swaps models in and out of VRAM automatically, but you can keep two models loaded simultaneously if they fit:

# Run two models in parallel (each ~5GB Q4_K_M)
# Terminal 1
ollama serve &

# Terminal 2 — chat model
curl http://localhost:11434/api/generate -d '{
  "model": "llama3:8b",
  "prompt": "Summarise this document..."
}'

# Terminal 3 — code model
curl http://localhost:11434/api/generate -d '{
  "model": "codellama:7b",
  "prompt": "Write a Python function..."
}'

Two 7B Q4 models use roughly 11GB total, leaving 13GB free for KV cache and context. This is ideal for AI coding assistant setups that pair a general chat model with a specialised code model.

Limits and When to Upgrade

The RTX 3090 runs out of room for Mixtral 8x7B (26GB Q4), Llama 3 70B (40GB Q4), and any model requiring more than 24GB. If you need these models, the RTX 5090 with 32GB fits Mixtral and can run larger models in aggressive quantisation. For 70B models, multi-GPU clusters are the path forward.

If you prefer production-grade API serving over Ollama’s simplicity, see our vLLM on RTX 3090 guide. Compare self-hosting costs against API pricing with the LLM cost calculator and explore more guides in the tutorials section.

RTX 3090 Servers for Ollama

24GB VRAM, instant model loading, full root access. The ideal GPU for running every model tier with Ollama.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Tutorials

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Ollama on RTX 3090: What Models Fit in 24GB?

Ollama and the RTX 3090: What You Get

Complete Model Compatibility Table

Install Ollama and Run Your First Model

Performance by Model Size

Running Multiple Models

Limits and When to Upgrade

RTX 3090 Servers for Ollama

Need a Dedicated GPU Server?

admin

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

Ollama on RTX 3090: What Models Fit in 24GB?

Ollama and the RTX 3090: What You Get

Complete Model Compatibility Table

Install Ollama and Run Your First Model

Performance by Model Size

Running Multiple Models

Limits and When to Upgrade

RTX 3090 Servers for Ollama

Need a Dedicated GPU Server?

admin

Related Articles

PyTorch CUDA Version Compatibility Matrix

Connect Slack to Self-Hosted LLM on GPU Server

Migrate from Together.ai to Dedicated GPU: Model Evaluation

Flask AI API: LLM Inference Wrapper

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?