RTX 3050 - Order Now
Home / Blog / Tutorials / llama.cpp Server Thread Tuning for Dedicated GPUs
Tutorials

llama.cpp Server Thread Tuning for Dedicated GPUs

llama.cpp exposes five thread-related knobs that interact in non-obvious ways. Getting them right doubles throughput on some dedicated configurations.

llama.cpp’s HTTP server is the common path for GGUF models on dedicated GPU hosting, especially with CPU-GPU mixed inference. Its thread flags are more nuanced than most guides let on. Here is what each one actually does.

Contents

Thread Flags

  • -t / --threads: CPU threads used during generation
  • -tb / --threads-batch: CPU threads used during prompt processing
  • -tt / --threads-thinking: Additional threads in newer builds
  • --parallel: Number of parallel inference slots (concurrent requests)
  • -ngl / --n-gpu-layers: Layers to offload to GPU

GPU-Only

When the entire model is on GPU (-ngl 999), CPU threads do little work during inference. Leave -t at a moderate value (4-8) and focus on --parallel.

llama-server -m model.gguf \
  -ngl 999 \
  -t 4 \
  --parallel 8 \
  --host 0.0.0.0 --port 8080

Hybrid CPU-GPU

When layers live on both CPU and GPU (-ngl 30 for half the layers on a 60-layer model), CPU threads matter. Set -t to the number of physical cores minus one or two. -tb can go higher for prompt processing because that workload is more parallel.

llama-server -m model.gguf \
  -ngl 30 \
  -t 14 \
  -tb 28 \
  --parallel 4

Recipes

GPUModelRecommended Config
RTX 3050 6GBPhi-3-mini Q4-ngl 999, -t 4, –parallel 2
RTX 4060 Ti 16GBLlama 3 8B Q5-ngl 999, -t 6, –parallel 4
RTX 5090Llama 3 70B IQ3_XS (partial)-ngl 55, -t 14, -tb 28, –parallel 2
RTX 6000 ProLlama 3 70B Q4_K_M-ngl 999, -t 8, –parallel 6

llama.cpp Tuned for GGUF Hosting

We benchmark and set thread counts on UK dedicated servers for your exact GGUF model.

Browse GPU Servers

See n-gpu-layers tuning and llama.cpp GPU GGUF.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?