Table of Contents
Why llama.cpp on a Dedicated GPU
llama.cpp is the foundational inference engine behind Ollama, LM Studio, and dozens of other LLM tools. Running it directly on a dedicated GPU server gives you maximum control over quantisation, context length, and memory allocation. GGUF models are the most widely available format on HuggingFace, with community-quantised versions of virtually every open model.
The key advantage of llama.cpp is flexibility. It supports partial GPU offloading (split a model between GPU and CPU RAM), every GGUF quantisation level from Q2_K to FP16, and runs on any NVIDIA GPU with CUDA support. For inference-heavy production workloads, vLLM offers better batching, but for single-user and development scenarios llama.cpp is hard to beat.
GGUF Quantisation Tiers Explained
| Quantisation | Bits | VRAM (7B) | Quality | Speed |
|---|---|---|---|---|
| FP16 | 16 | ~14.5 GB | Best | Baseline |
| Q8_0 | 8 | ~7.5 GB | Near-FP16 | Faster |
| Q6_K | 6 | ~6 GB | Excellent | Faster |
| Q5_K_M | 5 | ~5.5 GB | Very Good | Fast |
| Q4_K_M | 4 | ~4.8 GB | Good | Fastest |
| Q3_K_M | 3 | ~3.8 GB | Acceptable | Fastest |
| Q2_K | 2 | ~3 GB | Degraded | Fastest |
For most applications, Q4_K_M offers the best balance of quality and speed. Q5_K_M and Q6_K are worth the extra VRAM when you have it. Q8_0 is effectively lossless for most tasks. See our GPTQ vs AWQ vs GGUF guide for a deeper comparison.
Installation and GPU Build
# Clone and build llama.cpp with CUDA support
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j$(nproc)
# Download a GGUF model
wget https://huggingface.co/TheBloke/Llama-3-8B-Instruct-GGUF/resolve/main/llama-3-8b-instruct.Q4_K_M.gguf
# Test inference
./build/bin/llama-cli \
-m llama-3-8b-instruct.Q4_K_M.gguf \
-ngl 99 \
-p "Explain quantum computing in simple terms"
The -ngl 99 flag offloads all layers to the GPU. Reduce this number to split layers between GPU and CPU RAM when the model exceeds your VRAM. For CUDA setup details, see the CUDA installation guide.
Running llama.cpp in Server Mode
# Start the OpenAI-compatible API server
./build/bin/llama-server \
-m llama-3-8b-instruct.Q4_K_M.gguf \
-ngl 99 \
-c 4096 \
--host 0.0.0.0 \
--port 8080 \
--n-predict 512
# Test with curl
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3",
"messages": [{"role": "user", "content": "Hello!"}]
}'
The server exposes an OpenAI-compatible API, making it a drop-in replacement for any application using the OpenAI SDK. Secure the endpoint with an nginx reverse proxy as described in the secure AI inference API guide.
Performance Benchmarks by GPU
| Model (Q4_K_M) | RTX 4060 (t/s) | RTX 3090 (t/s) | RTX 5080 (t/s) | RTX 5090 (t/s) |
|---|---|---|---|---|
| Llama 3 8B | ~38 | ~75 | ~120 | ~145 |
| Mistral 7B | ~40 | ~78 | ~125 | ~150 |
| DeepSeek R1 7B | ~36 | ~72 | ~115 | ~140 |
| CodeLlama 34B | OOM | ~18 | OOM | ~32 |
| Mixtral 8x7B | OOM | OOM | OOM | ~30 |
llama.cpp performance scales nearly linearly with memory bandwidth. The Blackwell GPUs (5080, 5090) benefit from GDDR7’s higher throughput. Compare these numbers across more models at the tokens-per-second benchmark.
llama.cpp vs vLLM vs Ollama
Choose llama.cpp directly when you need fine-grained control over quantisation, partial GPU offloading for models larger than your VRAM, or specific GGUF model variants. Choose vLLM for production batch serving with continuous batching. Choose Ollama for the simplest setup experience (Ollama uses llama.cpp internally).
For self-hosting guidance beyond the inference engine, see the self-hosting LLM guide and calculate your costs with the LLM cost calculator. Explore more deployment options in the tutorials section.
GPU Servers for llama.cpp Inference
Full root access, CUDA pre-installed, any GGUF model. Dedicated GPU hardware in the UK.
Browse GPU Servers