RTX 3050 - Order Now
Home / Blog / Tutorials / Ollama on RTX 3090: What Models Fit in 24GB?
Tutorials

Ollama on RTX 3090: What Models Fit in 24GB?

Complete guide to which AI models run on the RTX 3090 with Ollama. Covers every model size from 7B to 70B, quantisation levels, and real-world performance in 24GB VRAM.

Ollama and the RTX 3090: What You Get

Ollama is the simplest way to run LLMs locally, and the RTX 3090 is the sweet spot for model variety. With 24GB GDDR6X VRAM on a dedicated GPU server, Ollama can load everything from 7B models in full precision to 70B models in aggressive quantisation. No other consumer GPU at this price point gives you access to so many model tiers.

This guide maps every popular model to its VRAM requirement and tells you exactly what fits. For a comparison of Ollama versus production-focused vLLM, see our vLLM vs Ollama guide.

Complete Model Compatibility Table

ModelQuantisationVRAM UsedFits 24GB?Ollama Tag
Llama 3 8BQ4_K_M~5.5 GBYesllama3:8b
Llama 3 8BFP16~16 GBYesllama3:8b-fp16
Mistral 7BQ4_K_M~5 GBYesmistral:7b
DeepSeek R1 7BQ4_K_M~5 GBYesdeepseek-r1:7b
DeepSeek R1 14BQ4_K_M~9 GBYesdeepseek-r1:14b
Qwen 2.5 14BQ4_K_M~9.5 GBYesqwen2.5:14b
CodeLlama 34BQ4_K_M~20 GBYes (tight)codellama:34b
Mixtral 8x7BQ4_K_M~26 GBNomixtral:8x7b
Llama 3 70BQ4_K_M~40 GBNollama3:70b
Llama 3 70BQ2_K~26 GBNo (edge)Custom GGUF

The RTX 3090 comfortably handles models up to 34B in Q4 quantisation. Models above 34B push past 24GB and require multi-GPU clusters or the RTX 5090 with 32GB.

Install Ollama and Run Your First Model

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull and run Llama 3 8B (default Q4_K_M)
ollama run llama3:8b

# Pull a larger model
ollama pull deepseek-r1:14b
ollama run deepseek-r1:14b

# FP16 for maximum quality (uses ~16GB)
ollama run llama3:8b-fp16

# Check which models are loaded
ollama list

Ollama automatically detects CUDA and uses the RTX 3090 for inference. For a detailed server setup walkthrough, see the Ollama dedicated GPU setup guide.

Performance by Model Size

ModelQuantisationTokens/sResponse Feel
Llama 3 8BQ4_K_M~82Instant
Llama 3 8BFP16~48Fast
Mistral 7BQ4_K_M~85Instant
DeepSeek R1 14BQ4_K_M~42Fast
Qwen 2.5 14BQ4_K_M~40Fast
CodeLlama 34BQ4_K_M~18Usable

Anything above 30 tokens/s feels real-time for interactive chat. The RTX 3090 keeps 7B-14B models well above that threshold. Check detailed benchmarks on the tokens-per-second benchmark tool.

Running Multiple Models

The 3090’s 24GB allows you to keep multiple smaller models loaded. Ollama swaps models in and out of VRAM automatically, but you can keep two models loaded simultaneously if they fit:

# Run two models in parallel (each ~5GB Q4_K_M)
# Terminal 1
ollama serve &

# Terminal 2 — chat model
curl http://localhost:11434/api/generate -d '{
  "model": "llama3:8b",
  "prompt": "Summarise this document..."
}'

# Terminal 3 — code model
curl http://localhost:11434/api/generate -d '{
  "model": "codellama:7b",
  "prompt": "Write a Python function..."
}'

Two 7B Q4 models use roughly 11GB total, leaving 13GB free for KV cache and context. This is ideal for AI coding assistant setups that pair a general chat model with a specialised code model.

Limits and When to Upgrade

The RTX 3090 runs out of room for Mixtral 8x7B (26GB Q4), Llama 3 70B (40GB Q4), and any model requiring more than 24GB. If you need these models, the RTX 5090 with 32GB fits Mixtral and can run larger models in aggressive quantisation. For 70B models, multi-GPU clusters are the path forward.

If you prefer production-grade API serving over Ollama’s simplicity, see our vLLM on RTX 3090 guide. Compare self-hosting costs against API pricing with the LLM cost calculator and explore more guides in the tutorials section.

RTX 3090 Servers for Ollama

24GB VRAM, instant model loading, full root access. The ideal GPU for running every model tier with Ollama.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?