Symptom: Ollama Generating Tokens Slowly on GPU
Your GPU server has plenty of VRAM, Ollama detects the GPU, yet token generation crawls at 5-10 tokens per second when you expected 50+. Running nvidia-smi shows the GPU is active but utilisation hovers around 20-30%. The Ollama logs reveal:
msg="model loaded" gpu_layers=20/35 vram="4.2 GiB"
That partial layer offload is the giveaway. Ollama is splitting the model between GPU and CPU, and every token generation round-trips across the PCIe bus. Fixing this requires ensuring full GPU offloading and tuning the inference parameters.
Check GPU Layer Offloading
Ollama automatically decides how many model layers to place on the GPU based on available VRAM. If other processes consume VRAM, fewer layers fit:
# Check current VRAM usage
nvidia-smi --query-gpu=memory.used,memory.total --format=csv
# Check Ollama's layer allocation
curl -s http://localhost:11434/api/ps | python3 -m json.tool
If gpu_layers is less than the total layer count, part of the model runs on CPU. Kill other GPU processes or switch to a smaller model to reclaim VRAM.
Use Properly Quantized Models
Running an FP16 model on a card with 24 GB of VRAM when a Q4_K_M quantization fits entirely is the most common performance mistake:
# Instead of the full-precision version
ollama pull llama3.1:70b
# Use a quantized variant that fits in VRAM
ollama pull llama3.1:70b-instruct-q4_K_M
A model fully loaded into VRAM at Q4 quantization will outperform a half-offloaded FP16 model every time. Check the benchmarks section for quantization performance comparisons across different GPU configurations.
Tune Context Length
Large context windows consume significant VRAM. Ollama defaults vary by model, but you can override them:
# Set a shorter context window to free VRAM for layers
curl http://localhost:11434/api/generate -d '{
"model": "llama3.1:8b",
"prompt": "Explain GPU hosting",
"options": {
"num_ctx": 4096
}
}'
Reducing context from 32768 to 4096 can free several gigabytes of VRAM, allowing all model layers onto the GPU. Only increase context length when your workload genuinely requires it.
Configure Parallel Request Handling
Ollama supports concurrent model instances, but each consumes additional VRAM. If multiple requests compete for GPU memory, performance drops for all of them:
# Limit to single concurrent request for maximum speed
OLLAMA_NUM_PARALLEL=1 ollama serve
# Or allow 2 parallel requests if VRAM permits
OLLAMA_NUM_PARALLEL=2 ollama serve
For production Ollama hosting, set this based on your GPU’s VRAM capacity and the model size. A single RTX 6000 Pro 96 GB can handle 2-3 parallel Llama 3.1 8B requests comfortably.
Enable Flash Attention
Newer Ollama versions support flash attention, which reduces memory usage and speeds up generation:
# Enable flash attention via environment variable
OLLAMA_FLASH_ATTENTION=1 ollama serve
Flash attention is particularly effective at longer context lengths and can improve throughput by 20-40% on supported architectures. Ensure your CUDA installation is up to date for compatibility.
Benchmark and Verify
# Run a generation and check the eval rate
curl -s http://localhost:11434/api/generate -d '{
"model": "llama3.1:8b",
"prompt": "Write a paragraph about cloud computing"
}' | grep -o '"eval_rate":[0-9.]*'
# Monitor GPU during generation
watch -n 0.5 nvidia-smi
Target eval rates: 8B models should hit 40-80 tok/s on an RTX 5090, 15-30 tok/s on an RTX 3090. If your numbers are far below these baselines, revisit layer offloading and VRAM allocation. For higher-throughput production deployments, vLLM offers continuous batching and PagedAttention. See the vLLM production setup guide and browse the tutorials for more optimization techniques. For PyTorch-based workloads, our PyTorch GPU installation guide covers the foundational setup.
Fast GPU Servers for Ollama
GigaGPU dedicated servers with RTX 5090, RTX 6000 Pro, and RTX 6000 Pro GPUs — full VRAM for maximum inference speed.
Browse GPU Servers