By default, Ollama unloads a model from VRAM five minutes after the last request. For a dedicated API endpoint on our hosting this causes intermittent cold-start latency of 10-30 seconds. A single environment variable fixes it.
Contents
Default
Ollama tracks idle time per loaded model. After 5 minutes of no requests, it calls ollama stop <model> internally and frees VRAM. The next request triggers a reload – weights copy from disk to VRAM, which takes seconds to tens of seconds depending on model size and storage speed.
Setting Keep-Alive
Set OLLAMA_KEEP_ALIVE in the systemd unit or shell environment:
Environment="OLLAMA_KEEP_ALIVE=24h"
Valid values:
-1or0: never unload (keep loaded until service restart)5m: default30m,2h,24h: keep loaded for specified duration
For a single-model API endpoint, set it to -1. VRAM use is predictable because one model always occupies its slot.
When Unloading Is Desired
If you serve multiple models from one GPU with VRAM too tight to hold all of them simultaneously, keep-alive at 5-15 minutes lets Ollama juggle. Request model A loads A (maybe unloading B). Request B later loads B. Cost is latency on each swap; benefit is you fit more models than the card technically holds at once.
Per-Request Override
The API accepts a keep_alive field per request:
curl http://localhost:11434/api/generate \
-d '{"model":"llama3","prompt":"Hello","keep_alive":"1h"}'
Setting "keep_alive": 0 forces unload immediately after the response – useful for one-off heavy models in a batch workflow.
| Scenario | OLLAMA_KEEP_ALIVE |
|---|---|
| Single-model production API | -1 |
| Multi-model, tight VRAM | 10m |
| Dev/test environment | 5m (default) |
| Batch processing, one-off models | 0 or per-request |
Pre-Tuned Ollama Hosting
We configure keep-alive and parallel settings on UK dedicated servers for your workload.
Browse GPU ServersSee num_parallel and max_queue and multi-model memory management.