Home / Blog / Tutorials / llama.cpp Server Thread Tuning for Dedicated GPUs

Tutorials

llama.cpp Server Thread Tuning for Dedicated GPUs

llama.cpp exposes five thread-related knobs that interact in non-obvious ways. Getting them right doubles throughput on some dedicated configurations.

Tutorials April 19, 2026 1 min read gigagpu

llama.cpp’s HTTP server is the common path for GGUF models on dedicated GPU hosting, especially with CPU-GPU mixed inference. Its thread flags are more nuanced than most guides let on. Here is what each one actually does.

The thread flags
GPU-only configurations
Hybrid CPU-GPU configurations
Recipes

Thread Flags

-t / --threads: CPU threads used during generation
-tb / --threads-batch: CPU threads used during prompt processing
-tt / --threads-thinking: Additional threads in newer builds
--parallel: Number of parallel inference slots (concurrent requests)
-ngl / --n-gpu-layers: Layers to offload to GPU

GPU-Only

When the entire model is on GPU (-ngl 999), CPU threads do little work during inference. Leave -t at a moderate value (4-8) and focus on --parallel.

llama-server -m model.gguf \
  -ngl 999 \
  -t 4 \
  --parallel 8 \
  --host 0.0.0.0 --port 8080

Hybrid CPU-GPU

When layers live on both CPU and GPU (-ngl 30 for half the layers on a 60-layer model), CPU threads matter. Set -t to the number of physical cores minus one or two. -tb can go higher for prompt processing because that workload is more parallel.

llama-server -m model.gguf \
  -ngl 30 \
  -t 14 \
  -tb 28 \
  --parallel 4

Recipes

GPU	Model	Recommended Config
RTX 3050 6GB	Phi-3-mini Q4	-ngl 999, -t 4, –parallel 2
RTX 4060 Ti 16GB	Llama 3 8B Q5	-ngl 999, -t 6, –parallel 4
RTX 5090	Llama 3 70B IQ3_XS (partial)	-ngl 55, -t 14, -tb 28, –parallel 2
RTX 6000 Pro	Llama 3 70B Q4_K_M	-ngl 999, -t 8, –parallel 6

llama.cpp Tuned for GGUF Hosting

We benchmark and set thread counts on UK dedicated servers for your exact GGUF model.

Browse GPU Servers

See n-gpu-layers tuning and llama.cpp GPU GGUF.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Tutorials

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

llama.cpp Server Thread Tuning for Dedicated GPUs

Contents

Thread Flags

GPU-Only

Hybrid CPU-GPU

Recipes

llama.cpp Tuned for GGUF Hosting

Need a Dedicated GPU Server?

gigagpu

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

llama.cpp Server Thread Tuning for Dedicated GPUs

Contents

Thread Flags

GPU-Only

Hybrid CPU-GPU

Recipes

llama.cpp Tuned for GGUF Hosting

Need a Dedicated GPU Server?

gigagpu

Related Articles

Connect Freshdesk to Self-Hosted AI on GPU

Migrate from Replicate to Dedicated GPU: Audio Transcription

RAG Chunking Strategy – What Actually Works

Connect Notion to Self-Hosted AI on GPU

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?