Among llama.cpp’s many flags, -ngl (or --n-gpu-layers) is the most important for GPU-accelerated GGUF inference on dedicated GPU hosting. It controls how many layers live on the GPU. The right value depends on model, VRAM, and what else is competing for the card.
Contents
What It Does
llama.cpp loads a GGUF model and decides per-layer whether it lives in CPU RAM or GPU VRAM. -ngl N puts the first N layers on the GPU. The remainder run on CPU. Most models have 32 (7B), 40 (13B), 48 (22B), 60 (32B), or 80 (70B) layers.
-ngl 999 is the idiomatic “put everything possible on GPU” value – llama.cpp clamps to the model’s actual layer count.
Layers Per GB
Rough VRAM per layer for common quantisations:
| Model | Quant | ~VRAM per layer |
|---|---|---|
| Llama 3 8B | Q5_K_M | ~150 MB |
| Llama 3 8B | Q4_K_M | ~120 MB |
| Llama 3 70B | Q4_K_M | ~500 MB |
| Llama 3 70B | IQ3_XS | ~380 MB |
Additional VRAM goes to KV cache, CUDA overhead, and the embedding/head layers. Reserve 15-20% headroom beyond your layer estimate.
Full GPU
Whenever the model fits, use -ngl 999. On a 4060 Ti 16GB:
llama-server -m Llama-3-8B-Q5_K_M.gguf -ngl 999 --ctx-size 8192
On a 6000 Pro 96GB with Llama 3 70B Q4_K_M: same pattern.
Partial Offload
When the model exceeds VRAM, pick the highest -ngl that fits. Example: Llama 3 70B Q4_K_M on a 24 GB 3090:
80 layers × 500 MB = 40 GB total. 24 GB of VRAM minus ~4 GB KV cache = 20 GB usable. 20,000 / 500 ≈ 40 layers. Start with -ngl 40.
llama-server -m Llama-3-70B-Q4_K_M.gguf -ngl 40 --ctx-size 4096 -t 14
Expect heavy CPU-GPU traffic and single-digit tokens/sec. For production you want the model fully on GPU – upgrade to a bigger card. See CPU-GPU offload strategy.
GGUF Hosting Without Offload Compromise
UK dedicated servers sized so your model lives fully on GPU for native speed.
Browse GPU Servers