RTX 3050 - Order Now
Home / Blog / Tutorials / Ollama num_parallel and num_queue Tuning
Tutorials

Ollama num_parallel and num_queue Tuning

Two Ollama environment variables control how many requests run in parallel versus queue. Defaults crash under moderate traffic.

Ollama’s defaults are tuned for a single-user developer laptop. On a dedicated GPU server hosting a real workload, two environment variables need adjusting: OLLAMA_NUM_PARALLEL and OLLAMA_MAX_QUEUE. Set them wrong and you either starve throughput or get OOM crashes.

Contents

OLLAMA_NUM_PARALLEL

Default: 4 (recent versions). Controls how many requests can run concurrently in one model. Each concurrent request needs its own KV cache slot, so raising this value costs VRAM.

For a 4060 Ti 16GB serving Llama 3 8B Q5, 4 is often right. For a 6000 Pro 96GB serving the same model, 32 or higher is fine.

OLLAMA_MAX_QUEUE

Default: 512. Maximum number of requests that can wait in the queue before Ollama returns HTTP 503. For a public API, keep this high to avoid rejecting legitimate bursts. For internal tools, lower values can serve as a back-pressure signal to clients.

Setting Them Together

Scenarionum_parallelmax_queue
8 GB card, single user132
16 GB card, small team4128
24-32 GB card, SaaS8-16512
96 GB card, high concurrency32-642048

Setting on the Server

Edit the systemd unit:

sudo systemctl edit ollama

Add:

[Service]
Environment="OLLAMA_NUM_PARALLEL=16"
Environment="OLLAMA_MAX_QUEUE=1024"
Environment="OLLAMA_KEEP_ALIVE=24h"
Environment="OLLAMA_HOST=0.0.0.0:11434"

Then sudo systemctl daemon-reload && sudo systemctl restart ollama.

Ollama Tuned for Production

UK dedicated hosting with Ollama pre-configured for your concurrency target.

Browse GPU Servers

See keep-alive memory tuning and multi-model memory.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?