RTX 3050 - Order Now
Home / Blog / Tutorials / llama.cpp n-gpu-layers Tuning for Mixed Inference
Tutorials

llama.cpp n-gpu-layers Tuning for Mixed Inference

-ngl controls how many transformer layers live on the GPU. Picking the right number balances speed against VRAM - with rules that depend on the model.

Among llama.cpp’s many flags, -ngl (or --n-gpu-layers) is the most important for GPU-accelerated GGUF inference on dedicated GPU hosting. It controls how many layers live on the GPU. The right value depends on model, VRAM, and what else is competing for the card.

Contents

What It Does

llama.cpp loads a GGUF model and decides per-layer whether it lives in CPU RAM or GPU VRAM. -ngl N puts the first N layers on the GPU. The remainder run on CPU. Most models have 32 (7B), 40 (13B), 48 (22B), 60 (32B), or 80 (70B) layers.

-ngl 999 is the idiomatic “put everything possible on GPU” value – llama.cpp clamps to the model’s actual layer count.

Layers Per GB

Rough VRAM per layer for common quantisations:

ModelQuant~VRAM per layer
Llama 3 8BQ5_K_M~150 MB
Llama 3 8BQ4_K_M~120 MB
Llama 3 70BQ4_K_M~500 MB
Llama 3 70BIQ3_XS~380 MB

Additional VRAM goes to KV cache, CUDA overhead, and the embedding/head layers. Reserve 15-20% headroom beyond your layer estimate.

Full GPU

Whenever the model fits, use -ngl 999. On a 4060 Ti 16GB:

llama-server -m Llama-3-8B-Q5_K_M.gguf -ngl 999 --ctx-size 8192

On a 6000 Pro 96GB with Llama 3 70B Q4_K_M: same pattern.

Partial Offload

When the model exceeds VRAM, pick the highest -ngl that fits. Example: Llama 3 70B Q4_K_M on a 24 GB 3090:

80 layers × 500 MB = 40 GB total. 24 GB of VRAM minus ~4 GB KV cache = 20 GB usable. 20,000 / 500 ≈ 40 layers. Start with -ngl 40.

llama-server -m Llama-3-70B-Q4_K_M.gguf -ngl 40 --ctx-size 4096 -t 14

Expect heavy CPU-GPU traffic and single-digit tokens/sec. For production you want the model fully on GPU – upgrade to a bigger card. See CPU-GPU offload strategy.

GGUF Hosting Without Offload Compromise

UK dedicated servers sized so your model lives fully on GPU for native speed.

Browse GPU Servers

See llama.cpp thread tuning and llama.cpp GPU GGUF.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?