RTX 3050 - Order Now
Home / Blog / Tutorials / Ollama Keep-Alive and Model Memory Tuning
Tutorials

Ollama Keep-Alive and Model Memory Tuning

Ollama unloads models from VRAM after idle. Adjust keep_alive to avoid cold-start latency or to share a GPU between models without manual reload.

By default, Ollama unloads a model from VRAM five minutes after the last request. For a dedicated API endpoint on our hosting this causes intermittent cold-start latency of 10-30 seconds. A single environment variable fixes it.

Contents

Default

Ollama tracks idle time per loaded model. After 5 minutes of no requests, it calls ollama stop <model> internally and frees VRAM. The next request triggers a reload – weights copy from disk to VRAM, which takes seconds to tens of seconds depending on model size and storage speed.

Setting Keep-Alive

Set OLLAMA_KEEP_ALIVE in the systemd unit or shell environment:

Environment="OLLAMA_KEEP_ALIVE=24h"

Valid values:

  • -1 or 0: never unload (keep loaded until service restart)
  • 5m: default
  • 30m, 2h, 24h: keep loaded for specified duration

For a single-model API endpoint, set it to -1. VRAM use is predictable because one model always occupies its slot.

When Unloading Is Desired

If you serve multiple models from one GPU with VRAM too tight to hold all of them simultaneously, keep-alive at 5-15 minutes lets Ollama juggle. Request model A loads A (maybe unloading B). Request B later loads B. Cost is latency on each swap; benefit is you fit more models than the card technically holds at once.

Per-Request Override

The API accepts a keep_alive field per request:

curl http://localhost:11434/api/generate \
  -d '{"model":"llama3","prompt":"Hello","keep_alive":"1h"}'

Setting "keep_alive": 0 forces unload immediately after the response – useful for one-off heavy models in a batch workflow.

ScenarioOLLAMA_KEEP_ALIVE
Single-model production API-1
Multi-model, tight VRAM10m
Dev/test environment5m (default)
Batch processing, one-off models0 or per-request

Pre-Tuned Ollama Hosting

We configure keep-alive and parallel settings on UK dedicated servers for your workload.

Browse GPU Servers

See num_parallel and max_queue and multi-model memory management.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?