Home / Blog / Tutorials / Ollama Keep-Alive and Model Memory Tuning

Tutorials

Ollama Keep-Alive and Model Memory Tuning

Ollama unloads models from VRAM after idle. Adjust keep_alive to avoid cold-start latency or to share a GPU between models without manual reload.

Tutorials April 19, 2026 2 min read admin

By default, Ollama unloads a model from VRAM five minutes after the last request. For a dedicated API endpoint on our hosting this causes intermittent cold-start latency of 10-30 seconds. A single environment variable fixes it.

Default

Ollama tracks idle time per loaded model. After 5 minutes of no requests, it calls ollama stop <model> internally and frees VRAM. The next request triggers a reload – weights copy from disk to VRAM, which takes seconds to tens of seconds depending on model size and storage speed.

Setting Keep-Alive

Set OLLAMA_KEEP_ALIVE in the systemd unit or shell environment:

Environment="OLLAMA_KEEP_ALIVE=24h"

Valid values:

-1 or 0: never unload (keep loaded until service restart)
5m: default
30m, 2h, 24h: keep loaded for specified duration

For a single-model API endpoint, set it to -1. VRAM use is predictable because one model always occupies its slot.

When Unloading Is Desired

If you serve multiple models from one GPU with VRAM too tight to hold all of them simultaneously, keep-alive at 5-15 minutes lets Ollama juggle. Request model A loads A (maybe unloading B). Request B later loads B. Cost is latency on each swap; benefit is you fit more models than the card technically holds at once.

Per-Request Override

The API accepts a keep_alive field per request:

curl http://localhost:11434/api/generate \
  -d '{"model":"llama3","prompt":"Hello","keep_alive":"1h"}'

Setting "keep_alive": 0 forces unload immediately after the response – useful for one-off heavy models in a batch workflow.

Scenario	OLLAMA_KEEP_ALIVE
Single-model production API	-1
Multi-model, tight VRAM	10m
Dev/test environment	5m (default)
Batch processing, one-off models	0 or per-request

Pre-Tuned Ollama Hosting

We configure keep-alive and parallel settings on UK dedicated servers for your workload.

Browse GPU Servers

See num_parallel and max_queue and multi-model memory management.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Tutorials

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Ollama Keep-Alive and Model Memory Tuning

Contents

Default

Setting Keep-Alive

When Unloading Is Desired

Per-Request Override

Pre-Tuned Ollama Hosting

Need a Dedicated GPU Server?

admin

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

Ollama Keep-Alive and Model Memory Tuning

Contents

Default

Setting Keep-Alive

When Unloading Is Desired

Per-Request Override

Pre-Tuned Ollama Hosting

Need a Dedicated GPU Server?

admin

Related Articles

GPU Not Detected in PyTorch: Troubleshooting Guide

Migrate from Together.ai to Dedicated GPU: Batch Processing

How to Set Up Ollama on a Dedicated GPU Server

FastAPI AI Inference Server: Complete Build

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?