Ollama is the fastest way to spin up an LLM on a fresh box – one command and you have an OpenAI-compatible API. Here’s the playbook for the RTX 5060 Ti 16GB at our hosting.
Contents
Install
curl -fsSL https://ollama.com/install.sh | sh
sudo systemctl enable --now ollama
Driver 560+ and CUDA 12.6+ assumed – see driver install guide.
Pull a Model
# Llama 3.1 8B Q4_K_M (default)
ollama pull llama3.1:8b
# Qwen 2.5 14B Q4
ollama pull qwen2.5:14b
# Phi-3 mini
ollama pull phi3:mini
# List and test
ollama list
ollama run llama3.1:8b "Hello, who are you?"
Use the API
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.1:8b",
"messages": [{"role": "user", "content": "Say hello"}]
}'
OpenAI-compatible – point any OpenAI SDK at http://your-server:11434/v1.
Config for 16 GB
Ollama default settings work but you can optimise via /etc/systemd/system/ollama.service.d/override.conf:
[Service]
Environment="OLLAMA_KEEP_ALIVE=30m"
Environment="OLLAMA_NUM_PARALLEL=4"
Environment="OLLAMA_MAX_LOADED_MODELS=2"
Environment="OLLAMA_FLASH_ATTENTION=1"
Environment="OLLAMA_KV_CACHE_TYPE=q8_0"
OLLAMA_KV_CACHE_TYPE=q8_0 quantises KV cache to 8-bit – gives you more context on 16 GB.
Ollama vs vLLM
| Aspect | Ollama | vLLM |
|---|---|---|
| Setup time | 2 minutes | 15 minutes |
| Single-user throughput | Good (GGUF Q4) | Better (FP8) |
| Concurrent user throughput | Moderate | Excellent |
| Model format | GGUF only | HF, AWQ, GPTQ, FP8, GGUF |
| Best for | Solo dev, quick start | Production, concurrency |
Start with Ollama for easy setup, move to vLLM when you need concurrency or precise quantisation control.
See also: vLLM setup, llama.cpp setup, GGUF hosting, OpenWebUI setup.