RTX 3050 - Order Now
Home / Blog / Tutorials / Ollama Multi-Model Memory Management
Tutorials

Ollama Multi-Model Memory Management

Manage multiple models in Ollama on a single GPU server. Covers VRAM allocation strategies, model swapping, concurrent loading limits, memory reclamation, and multi-GPU model pinning.

Running Multiple Models on One Server

Production deployments often need more than one model: a large model for complex reasoning, a small model for classification, and perhaps an embedding model for search. On a single GPU server, fitting all of these into VRAM simultaneously requires deliberate memory management. Without it, Ollama loads and unloads models unpredictably, causing latency spikes every time a cold model needs reloading.

How Ollama Handles Model Loading

By default, Ollama keeps the most recently used model in VRAM and unloads it when a different model is requested (if VRAM is insufficient for both):

# Check which models are currently loaded
curl -s http://localhost:11434/api/ps | python3 -m json.tool

# Sample output showing one loaded model
{
  "models": [{
    "name": "llama3.1:8b",
    "size": 4915200000,
    "size_vram": 4915200000,
    "expires_at": "2025-01-15T10:30:00Z"
  }]
}

The expires_at field shows when Ollama will unload the model from memory. By default, models stay loaded for 5 minutes after the last request.

Configure Concurrent Model Limits

Control how many models Ollama keeps in VRAM simultaneously:

# Allow up to 3 models loaded at once
OLLAMA_MAX_LOADED_MODELS=3 ollama serve

# Combined with parallel request handling
OLLAMA_MAX_LOADED_MODELS=2 OLLAMA_NUM_PARALLEL=4 ollama serve

# For systemd configuration
sudo systemctl edit ollama
# /etc/systemd/system/ollama.service.d/override.conf
[Service]
Environment="OLLAMA_MAX_LOADED_MODELS=2"
Environment="OLLAMA_NUM_PARALLEL=2"
sudo systemctl daemon-reload
sudo systemctl restart ollama

Set OLLAMA_MAX_LOADED_MODELS based on your total VRAM divided by the average model size. An RTX 6000 Pro 96 GB can hold two 8B Q4 models (~10 GB each) with plenty of room for KV caches.

Control Model Keep-Alive Duration

Adjust how long idle models remain in VRAM before automatic unloading:

# Keep model loaded for 30 minutes after last request
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1:8b",
  "prompt": "Hello",
  "keep_alive": "30m"
}'

# Keep model loaded indefinitely (never auto-unload)
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1:8b",
  "prompt": "Hello",
  "keep_alive": -1
}'

# Immediately unload a specific model
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1:8b",
  "keep_alive": 0
}'

Pin your most-used model with keep_alive: -1 and let less frequent models load on demand.

Pin Models to Specific GPUs

On multi-GPU servers, dedicate each GPU to a different model for zero-swap-time serving:

# Terminal 1: Ollama instance for large model on GPU 0 and 1
CUDA_VISIBLE_DEVICES=0,1 OLLAMA_HOST=0.0.0.0:11434 ollama serve

# Terminal 2: Separate Ollama instance for small model on GPU 2
CUDA_VISIBLE_DEVICES=2 OLLAMA_HOST=0.0.0.0:11435 ollama serve

Your application routes requests to the appropriate port based on which model it needs. This eliminates model swapping entirely at the cost of dedicating GPU resources.

Memory Monitoring Strategy

# Monitor VRAM usage across all GPUs
watch -n 2 nvidia-smi --query-gpu=index,memory.used,memory.total,utilization.gpu --format=csv

# Script to track model loading events
while true; do
  curl -s http://localhost:11434/api/ps | python3 -c "
import sys, json
data = json.load(sys.stdin)
for m in data.get('models', []):
    vram_gb = m['size_vram'] / 1e9
    print(f\"{m['name']}: {vram_gb:.1f} GB VRAM, expires {m['expires_at']}\")
"
  sleep 10
done

Track loading frequency to identify models that should be pinned versus those that can load on demand. For workloads requiring dozens of concurrent models, vLLM with its PagedAttention provides more granular memory control as outlined in the production setup guide. Check the benchmarks section for VRAM usage across model sizes, and review the CUDA guide for multi-GPU driver configuration. The tutorials cover PyTorch memory profiling and the LLM hosting blog compares serving architectures.

Multi-GPU Servers for Multi-Model Deployments

GigaGPU offers 2x, 4x, and 8x GPU configurations with up to 640 GB total VRAM for running multiple models simultaneously.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?