GGUF is llama.cpp’s format for quantised models. On the RTX 5060 Ti 16GB at our hosting, GGUF serves via llama-server with full GPU offload. A lightweight alternative to vLLM for specific use cases.
Contents
GGUF Variants
Common GGUF quantisations on a 16 GB card:
| Quant | Bits | Quality | Use Case |
|---|---|---|---|
| Q8_0 | 8 | Near-FP16 | Small models where quality matters |
| Q6_K | ~6.5 | Very close to FP16 | Balanced quality/size |
| Q5_K_M | ~5.5 | Strong | Balanced production default |
| Q4_K_M | ~4.5 | Good | Standard quantisation |
| IQ3_XS | ~3 | Noticeable quality loss | Larger models that would not fit |
| IQ2_XS | ~2 | Significant quality loss | Extreme fits only |
Serving
llama-server \
-m Llama-3.1-8B-Instruct-Q5_K_M.gguf \
-ngl 999 \
--ctx-size 8192 \
--parallel 8 \
--host 0.0.0.0 --port 8080 \
-fa
-ngl 999 puts all layers on GPU. --parallel 8 allows 8 concurrent sequences. -fa enables Flash Attention.
Tuning
-t 4– CPU threads (low value when fully on GPU)-tb 8– CPU threads for prompt processing--cont-batching– continuous batching like vLLM--flash-attnor-fa– Flash Attention kernels
vs vLLM / AWQ
- GGUF: llama.cpp runtime, single binary, simpler deploy, broader hardware support
- AWQ in vLLM: PagedAttention for better concurrency, OpenAI-compatible API native, better for production serving
- FP8 in vLLM: Blackwell-native, best speed
When GGUF
- You already have GGUF checkpoints from local runs
- Single-process lightweight deployment
- You want llama.cpp’s CPU+GPU hybrid offload for oversized models
- Cross-hardware compatibility matters (same GGUF runs on Mac/AMD/Intel)
For vLLM-style concurrent serving, prefer AWQ or FP8. For simpler single-process deployments, GGUF works.
GGUF Ready on Blackwell
llama.cpp with full GPU offload. UK dedicated hosting.
Order the RTX 5060 Ti 16GBSee also: llama.cpp setup, n-gpu-layers tuning.