Ollama’s defaults are tuned for a single-user developer laptop. On a dedicated GPU server hosting a real workload, two environment variables need adjusting: OLLAMA_NUM_PARALLEL and OLLAMA_MAX_QUEUE. Set them wrong and you either starve throughput or get OOM crashes.
Contents
OLLAMA_NUM_PARALLEL
Default: 4 (recent versions). Controls how many requests can run concurrently in one model. Each concurrent request needs its own KV cache slot, so raising this value costs VRAM.
For a 4060 Ti 16GB serving Llama 3 8B Q5, 4 is often right. For a 6000 Pro 96GB serving the same model, 32 or higher is fine.
OLLAMA_MAX_QUEUE
Default: 512. Maximum number of requests that can wait in the queue before Ollama returns HTTP 503. For a public API, keep this high to avoid rejecting legitimate bursts. For internal tools, lower values can serve as a back-pressure signal to clients.
Setting Them Together
| Scenario | num_parallel | max_queue |
|---|---|---|
| 8 GB card, single user | 1 | 32 |
| 16 GB card, small team | 4 | 128 |
| 24-32 GB card, SaaS | 8-16 | 512 |
| 96 GB card, high concurrency | 32-64 | 2048 |
Setting on the Server
Edit the systemd unit:
sudo systemctl edit ollama
Add:
[Service]
Environment="OLLAMA_NUM_PARALLEL=16"
Environment="OLLAMA_MAX_QUEUE=1024"
Environment="OLLAMA_KEEP_ALIVE=24h"
Environment="OLLAMA_HOST=0.0.0.0:11434"
Then sudo systemctl daemon-reload && sudo systemctl restart ollama.
Ollama Tuned for Production
UK dedicated hosting with Ollama pre-configured for your concurrency target.
Browse GPU Servers