llama.cpp remains the lightweight option when you want GGUF, portable builds, and minimal dependencies. Build and serve on the RTX 5060 Ti 16GB at our hosting:
Contents
Build
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="120"
cmake --build build --config Release -j
CMAKE_CUDA_ARCHITECTURES="120" is Blackwell sm_120 – use the architecture string matching your CUDA toolkit. If your CMake errors, omit the flag and let it autodetect.
Download GGUF Model
mkdir -p models
huggingface-cli download bartowski/Meta-Llama-3.1-8B-Instruct-GGUF \
Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
--local-dir models
Run llama-server
./build/bin/llama-server \
-m models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
-ngl 99 \
-c 32768 \
--flash-attn \
--cache-type-k q8_0 --cache-type-v q8_0 \
--host 0.0.0.0 --port 8080
OpenAI-compatible endpoint at http://host:8080/v1.
Blackwell Tuning
| Flag | Purpose |
|---|---|
-ngl 99 | Offload all layers to GPU |
--flash-attn | FlashAttention – big speedup |
--cache-type-k/v q8_0 | 8-bit KV cache – doubles context |
-c 32768 | Context size |
--parallel 4 | 4 concurrent slots |
--n-predict 512 | Default max output |
Expected speeds: ~95 t/s batch 1 on Llama 3 8B Q4_K_M. See GGUF hosting guide for more variants.
llama.cpp on Blackwell 16GB
Lightweight GGUF server, full CUDA acceleration. UK dedicated hosting.
Order the RTX 5060 Ti 16GBSee also: GGUF hosting, Ollama setup, vLLM setup, TGI setup.