Home / Blog / Tutorials / llama.cpp n-gpu-layers Tuning for Mixed Inference

Tutorials

llama.cpp n-gpu-layers Tuning for Mixed Inference

-ngl controls how many transformer layers live on the GPU. Picking the right number balances speed against VRAM - with rules that depend on the model.

Tutorials April 19, 2026 2 min read admin

Among llama.cpp’s many flags, -ngl (or --n-gpu-layers) is the most important for GPU-accelerated GGUF inference on dedicated GPU hosting. It controls how many layers live on the GPU. The right value depends on model, VRAM, and what else is competing for the card.

What It Does

llama.cpp loads a GGUF model and decides per-layer whether it lives in CPU RAM or GPU VRAM. -ngl N puts the first N layers on the GPU. The remainder run on CPU. Most models have 32 (7B), 40 (13B), 48 (22B), 60 (32B), or 80 (70B) layers.

-ngl 999 is the idiomatic “put everything possible on GPU” value – llama.cpp clamps to the model’s actual layer count.

Layers Per GB

Rough VRAM per layer for common quantisations:

Model	Quant	~VRAM per layer
Llama 3 8B	Q5_K_M	~150 MB
Llama 3 8B	Q4_K_M	~120 MB
Llama 3 70B	Q4_K_M	~500 MB
Llama 3 70B	IQ3_XS	~380 MB

Additional VRAM goes to KV cache, CUDA overhead, and the embedding/head layers. Reserve 15-20% headroom beyond your layer estimate.

Full GPU

Whenever the model fits, use -ngl 999. On a 4060 Ti 16GB:

llama-server -m Llama-3-8B-Q5_K_M.gguf -ngl 999 --ctx-size 8192

On a 6000 Pro 96GB with Llama 3 70B Q4_K_M: same pattern.

Partial Offload

When the model exceeds VRAM, pick the highest -ngl that fits. Example: Llama 3 70B Q4_K_M on a 24 GB 3090:

80 layers × 500 MB = 40 GB total. 24 GB of VRAM minus ~4 GB KV cache = 20 GB usable. 20,000 / 500 ≈ 40 layers. Start with -ngl 40.

llama-server -m Llama-3-70B-Q4_K_M.gguf -ngl 40 --ctx-size 4096 -t 14

Expect heavy CPU-GPU traffic and single-digit tokens/sec. For production you want the model fully on GPU – upgrade to a bigger card. See CPU-GPU offload strategy.

GGUF Hosting Without Offload Compromise

UK dedicated servers sized so your model lives fully on GPU for native speed.

Browse GPU Servers

See llama.cpp thread tuning and llama.cpp GPU GGUF.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Tutorials

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

llama.cpp n-gpu-layers Tuning for Mixed Inference

Contents

What It Does

Layers Per GB

Full GPU

Partial Offload

GGUF Hosting Without Offload Compromise

Need a Dedicated GPU Server?

admin

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

llama.cpp n-gpu-layers Tuning for Mixed Inference

Contents

What It Does

Layers Per GB

Full GPU

Partial Offload

GGUF Hosting Without Offload Compromise

Need a Dedicated GPU Server?

admin

Related Articles

vLLM + Nginx: Fixing Proxy Timeout Issues

Best AI Agent Frameworks in 2026 (Updated April 2026)

Migrate from RunPod to Dedicated GPU: Multi-Model Serving

Connect Ansible to Automate GPU Server Setup

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?