Table of Contents
Ollama and the RTX 3090: What You Get
Ollama is the simplest way to run LLMs locally, and the RTX 3090 is the sweet spot for model variety. With 24GB GDDR6X VRAM on a dedicated GPU server, Ollama can load everything from 7B models in full precision to 70B models in aggressive quantisation. No other consumer GPU at this price point gives you access to so many model tiers.
This guide maps every popular model to its VRAM requirement and tells you exactly what fits. For a comparison of Ollama versus production-focused vLLM, see our vLLM vs Ollama guide.
Complete Model Compatibility Table
| Model | Quantisation | VRAM Used | Fits 24GB? | Ollama Tag |
|---|---|---|---|---|
| Llama 3 8B | Q4_K_M | ~5.5 GB | Yes | llama3:8b |
| Llama 3 8B | FP16 | ~16 GB | Yes | llama3:8b-fp16 |
| Mistral 7B | Q4_K_M | ~5 GB | Yes | mistral:7b |
| DeepSeek R1 7B | Q4_K_M | ~5 GB | Yes | deepseek-r1:7b |
| DeepSeek R1 14B | Q4_K_M | ~9 GB | Yes | deepseek-r1:14b |
| Qwen 2.5 14B | Q4_K_M | ~9.5 GB | Yes | qwen2.5:14b |
| CodeLlama 34B | Q4_K_M | ~20 GB | Yes (tight) | codellama:34b |
| Mixtral 8x7B | Q4_K_M | ~26 GB | No | mixtral:8x7b |
| Llama 3 70B | Q4_K_M | ~40 GB | No | llama3:70b |
| Llama 3 70B | Q2_K | ~26 GB | No (edge) | Custom GGUF |
The RTX 3090 comfortably handles models up to 34B in Q4 quantisation. Models above 34B push past 24GB and require multi-GPU clusters or the RTX 5090 with 32GB.
Install Ollama and Run Your First Model
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Pull and run Llama 3 8B (default Q4_K_M)
ollama run llama3:8b
# Pull a larger model
ollama pull deepseek-r1:14b
ollama run deepseek-r1:14b
# FP16 for maximum quality (uses ~16GB)
ollama run llama3:8b-fp16
# Check which models are loaded
ollama list
Ollama automatically detects CUDA and uses the RTX 3090 for inference. For a detailed server setup walkthrough, see the Ollama dedicated GPU setup guide.
Performance by Model Size
| Model | Quantisation | Tokens/s | Response Feel |
|---|---|---|---|
| Llama 3 8B | Q4_K_M | ~82 | Instant |
| Llama 3 8B | FP16 | ~48 | Fast |
| Mistral 7B | Q4_K_M | ~85 | Instant |
| DeepSeek R1 14B | Q4_K_M | ~42 | Fast |
| Qwen 2.5 14B | Q4_K_M | ~40 | Fast |
| CodeLlama 34B | Q4_K_M | ~18 | Usable |
Anything above 30 tokens/s feels real-time for interactive chat. The RTX 3090 keeps 7B-14B models well above that threshold. Check detailed benchmarks on the tokens-per-second benchmark tool.
Running Multiple Models
The 3090’s 24GB allows you to keep multiple smaller models loaded. Ollama swaps models in and out of VRAM automatically, but you can keep two models loaded simultaneously if they fit:
# Run two models in parallel (each ~5GB Q4_K_M)
# Terminal 1
ollama serve &
# Terminal 2 — chat model
curl http://localhost:11434/api/generate -d '{
"model": "llama3:8b",
"prompt": "Summarise this document..."
}'
# Terminal 3 — code model
curl http://localhost:11434/api/generate -d '{
"model": "codellama:7b",
"prompt": "Write a Python function..."
}'
Two 7B Q4 models use roughly 11GB total, leaving 13GB free for KV cache and context. This is ideal for AI coding assistant setups that pair a general chat model with a specialised code model.
Limits and When to Upgrade
The RTX 3090 runs out of room for Mixtral 8x7B (26GB Q4), Llama 3 70B (40GB Q4), and any model requiring more than 24GB. If you need these models, the RTX 5090 with 32GB fits Mixtral and can run larger models in aggressive quantisation. For 70B models, multi-GPU clusters are the path forward.
If you prefer production-grade API serving over Ollama’s simplicity, see our vLLM on RTX 3090 guide. Compare self-hosting costs against API pricing with the LLM cost calculator and explore more guides in the tutorials section.
RTX 3090 Servers for Ollama
24GB VRAM, instant model loading, full root access. The ideal GPU for running every model tier with Ollama.
Browse GPU Servers