llama.cpp’s HTTP server is the common path for GGUF models on dedicated GPU hosting, especially with CPU-GPU mixed inference. Its thread flags are more nuanced than most guides let on. Here is what each one actually does.
Contents
Thread Flags
-t/--threads: CPU threads used during generation-tb/--threads-batch: CPU threads used during prompt processing-tt/--threads-thinking: Additional threads in newer builds--parallel: Number of parallel inference slots (concurrent requests)-ngl/--n-gpu-layers: Layers to offload to GPU
GPU-Only
When the entire model is on GPU (-ngl 999), CPU threads do little work during inference. Leave -t at a moderate value (4-8) and focus on --parallel.
llama-server -m model.gguf \
-ngl 999 \
-t 4 \
--parallel 8 \
--host 0.0.0.0 --port 8080
Hybrid CPU-GPU
When layers live on both CPU and GPU (-ngl 30 for half the layers on a 60-layer model), CPU threads matter. Set -t to the number of physical cores minus one or two. -tb can go higher for prompt processing because that workload is more parallel.
llama-server -m model.gguf \
-ngl 30 \
-t 14 \
-tb 28 \
--parallel 4
Recipes
| GPU | Model | Recommended Config |
|---|---|---|
| RTX 3050 6GB | Phi-3-mini Q4 | -ngl 999, -t 4, –parallel 2 |
| RTX 4060 Ti 16GB | Llama 3 8B Q5 | -ngl 999, -t 6, –parallel 4 |
| RTX 5090 | Llama 3 70B IQ3_XS (partial) | -ngl 55, -t 14, -tb 28, –parallel 2 |
| RTX 6000 Pro | Llama 3 70B Q4_K_M | -ngl 999, -t 8, –parallel 6 |
llama.cpp Tuned for GGUF Hosting
We benchmark and set thread counts on UK dedicated servers for your exact GGUF model.
Browse GPU ServersSee n-gpu-layers tuning and llama.cpp GPU GGUF.