Home / Blog / Tutorials / llama.cpp on GPU Server: GGUF Performance Guide

Tutorials

llama.cpp on GPU Server: GGUF Performance Guide

Deploy llama.cpp on a dedicated GPU server for fast GGUF model inference. Covers GPU offloading, quantisation tiers, server mode configuration, and performance benchmarks by GPU.

Tutorials April 17, 2026 3 min read admin

Table of Contents

Why llama.cpp on a Dedicated GPU
GGUF Quantisation Tiers Explained
Installation and GPU Build
Running llama.cpp in Server Mode
Performance Benchmarks by GPU
llama.cpp vs vLLM vs Ollama

Why llama.cpp on a Dedicated GPU

llama.cpp is the foundational inference engine behind Ollama, LM Studio, and dozens of other LLM tools. Running it directly on a dedicated GPU server gives you maximum control over quantisation, context length, and memory allocation. GGUF models are the most widely available format on HuggingFace, with community-quantised versions of virtually every open model.

The key advantage of llama.cpp is flexibility. It supports partial GPU offloading (split a model between GPU and CPU RAM), every GGUF quantisation level from Q2_K to FP16, and runs on any NVIDIA GPU with CUDA support. For inference-heavy production workloads, vLLM offers better batching, but for single-user and development scenarios llama.cpp is hard to beat.

GGUF Quantisation Tiers Explained

Quantisation	Bits	VRAM (7B)	Quality	Speed
FP16	16	~14.5 GB	Best	Baseline
Q8_0	8	~7.5 GB	Near-FP16	Faster
Q6_K	6	~6 GB	Excellent	Faster
Q5_K_M	5	~5.5 GB	Very Good	Fast
Q4_K_M	4	~4.8 GB	Good	Fastest
Q3_K_M	3	~3.8 GB	Acceptable	Fastest
Q2_K	2	~3 GB	Degraded	Fastest

For most applications, Q4_K_M offers the best balance of quality and speed. Q5_K_M and Q6_K are worth the extra VRAM when you have it. Q8_0 is effectively lossless for most tasks. See our GPTQ vs AWQ vs GGUF guide for a deeper comparison.

Installation and GPU Build

# Clone and build llama.cpp with CUDA support
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j$(nproc)

# Download a GGUF model
wget https://huggingface.co/TheBloke/Llama-3-8B-Instruct-GGUF/resolve/main/llama-3-8b-instruct.Q4_K_M.gguf

# Test inference
./build/bin/llama-cli \
  -m llama-3-8b-instruct.Q4_K_M.gguf \
  -ngl 99 \
  -p "Explain quantum computing in simple terms"

The -ngl 99 flag offloads all layers to the GPU. Reduce this number to split layers between GPU and CPU RAM when the model exceeds your VRAM. For CUDA setup details, see the CUDA installation guide.

Running llama.cpp in Server Mode

# Start the OpenAI-compatible API server
./build/bin/llama-server \
  -m llama-3-8b-instruct.Q4_K_M.gguf \
  -ngl 99 \
  -c 4096 \
  --host 0.0.0.0 \
  --port 8080 \
  --n-predict 512

# Test with curl
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

The server exposes an OpenAI-compatible API, making it a drop-in replacement for any application using the OpenAI SDK. Secure the endpoint with an nginx reverse proxy as described in the secure AI inference API guide.

Performance Benchmarks by GPU

Model (Q4_K_M)	RTX 4060 (t/s)	RTX 3090 (t/s)	RTX 5080 (t/s)	RTX 5090 (t/s)
Llama 3 8B	~38	~75	~120	~145
Mistral 7B	~40	~78	~125	~150
DeepSeek R1 7B	~36	~72	~115	~140
CodeLlama 34B	OOM	~18	OOM	~32
Mixtral 8x7B	OOM	OOM	OOM	~30

llama.cpp performance scales nearly linearly with memory bandwidth. The Blackwell GPUs (5080, 5090) benefit from GDDR7’s higher throughput. Compare these numbers across more models at the tokens-per-second benchmark.

llama.cpp vs vLLM vs Ollama

Choose llama.cpp directly when you need fine-grained control over quantisation, partial GPU offloading for models larger than your VRAM, or specific GGUF model variants. Choose vLLM for production batch serving with continuous batching. Choose Ollama for the simplest setup experience (Ollama uses llama.cpp internally).

For self-hosting guidance beyond the inference engine, see the self-hosting LLM guide and calculate your costs with the LLM cost calculator. Explore more deployment options in the tutorials section.

GPU Servers for llama.cpp Inference

Full root access, CUDA pre-installed, any GGUF model. Dedicated GPU hardware in the UK.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Tutorials

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

llama.cpp on GPU Server: GGUF Performance Guide

Why llama.cpp on a Dedicated GPU

GGUF Quantisation Tiers Explained

Installation and GPU Build

Running llama.cpp in Server Mode

Performance Benchmarks by GPU

llama.cpp vs vLLM vs Ollama

GPU Servers for llama.cpp Inference

Need a Dedicated GPU Server?

admin

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

llama.cpp on GPU Server: GGUF Performance Guide

Why llama.cpp on a Dedicated GPU

GGUF Quantisation Tiers Explained

Installation and GPU Build

Running llama.cpp in Server Mode

Performance Benchmarks by GPU

llama.cpp vs vLLM vs Ollama

GPU Servers for llama.cpp Inference

Need a Dedicated GPU Server?

admin

Related Articles

How to Run Multiple AI Models on a Single GPU Server

Migrate from Azure OpenAI to Dedicated GPU: Content Moderation Guide

Ollama num_parallel and num_queue Tuning

Connect VS Code to Self-Hosted Code Model on GPU

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?