Home / Blog / Tutorials / Ollama num_parallel and num_queue Tuning

Tutorials

Ollama num_parallel and num_queue Tuning

Two Ollama environment variables control how many requests run in parallel versus queue. Defaults crash under moderate traffic.

Tutorials April 19, 2026 1 min read gigagpu

Ollama’s defaults are tuned for a single-user developer laptop. On a dedicated GPU server hosting a real workload, two environment variables need adjusting: OLLAMA_NUM_PARALLEL and OLLAMA_MAX_QUEUE. Set them wrong and you either starve throughput or get OOM crashes.

OLLAMA_NUM_PARALLEL
OLLAMA_MAX_QUEUE
Setting them together
Setting them on a server

OLLAMA_NUM_PARALLEL

Default: 4 (recent versions). Controls how many requests can run concurrently in one model. Each concurrent request needs its own KV cache slot, so raising this value costs VRAM.

For a 4060 Ti 16GB serving Llama 3 8B Q5, 4 is often right. For a 6000 Pro 96GB serving the same model, 32 or higher is fine.

OLLAMA_MAX_QUEUE

Default: 512. Maximum number of requests that can wait in the queue before Ollama returns HTTP 503. For a public API, keep this high to avoid rejecting legitimate bursts. For internal tools, lower values can serve as a back-pressure signal to clients.

Setting Them Together

Scenario	num_parallel	max_queue
8 GB card, single user	1	32
16 GB card, small team	4	128
24-32 GB card, SaaS	8-16	512
96 GB card, high concurrency	32-64	2048

Setting on the Server

Edit the systemd unit:

sudo systemctl edit ollama

Add:

[Service]
Environment="OLLAMA_NUM_PARALLEL=16"
Environment="OLLAMA_MAX_QUEUE=1024"
Environment="OLLAMA_KEEP_ALIVE=24h"
Environment="OLLAMA_HOST=0.0.0.0:11434"

Then sudo systemctl daemon-reload && sudo systemctl restart ollama.

Ollama Tuned for Production

UK dedicated hosting with Ollama pre-configured for your concurrency target.

Browse GPU Servers

See keep-alive memory tuning and multi-model memory.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Tutorials

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Ollama num_parallel and num_queue Tuning

Contents

OLLAMA_NUM_PARALLEL

OLLAMA_MAX_QUEUE

Setting Them Together

Setting on the Server

Ollama Tuned for Production

Need a Dedicated GPU Server?

gigagpu

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

Ollama num_parallel and num_queue Tuning

Contents

OLLAMA_NUM_PARALLEL

OLLAMA_MAX_QUEUE

Setting Them Together

Setting on the Server

Ollama Tuned for Production

Need a Dedicated GPU Server?

gigagpu

Related Articles

Connect Kafka to AI Streaming on GPU

Dataset Versioning for Fine-Tuning

ControlNet Errors in Stable Diffusion: Fix Guide

Semantic Kernel vs LangChain

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?