RTX 3050 - Order Now
Home / Blog / Tutorials / Ollama Slow on GPU: Speed Optimization
Tutorials

Ollama Slow on GPU: Speed Optimization

Fix slow Ollama inference on GPU servers. Covers VRAM allocation, model quantization, context length tuning, GPU offloading layers, and batch configuration for faster token generation.

Symptom: Ollama Generating Tokens Slowly on GPU

Your GPU server has plenty of VRAM, Ollama detects the GPU, yet token generation crawls at 5-10 tokens per second when you expected 50+. Running nvidia-smi shows the GPU is active but utilisation hovers around 20-30%. The Ollama logs reveal:

msg="model loaded" gpu_layers=20/35 vram="4.2 GiB"

That partial layer offload is the giveaway. Ollama is splitting the model between GPU and CPU, and every token generation round-trips across the PCIe bus. Fixing this requires ensuring full GPU offloading and tuning the inference parameters.

Check GPU Layer Offloading

Ollama automatically decides how many model layers to place on the GPU based on available VRAM. If other processes consume VRAM, fewer layers fit:

# Check current VRAM usage
nvidia-smi --query-gpu=memory.used,memory.total --format=csv

# Check Ollama's layer allocation
curl -s http://localhost:11434/api/ps | python3 -m json.tool

If gpu_layers is less than the total layer count, part of the model runs on CPU. Kill other GPU processes or switch to a smaller model to reclaim VRAM.

Use Properly Quantized Models

Running an FP16 model on a card with 24 GB of VRAM when a Q4_K_M quantization fits entirely is the most common performance mistake:

# Instead of the full-precision version
ollama pull llama3.1:70b

# Use a quantized variant that fits in VRAM
ollama pull llama3.1:70b-instruct-q4_K_M

A model fully loaded into VRAM at Q4 quantization will outperform a half-offloaded FP16 model every time. Check the benchmarks section for quantization performance comparisons across different GPU configurations.

Tune Context Length

Large context windows consume significant VRAM. Ollama defaults vary by model, but you can override them:

# Set a shorter context window to free VRAM for layers
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1:8b",
  "prompt": "Explain GPU hosting",
  "options": {
    "num_ctx": 4096
  }
}'

Reducing context from 32768 to 4096 can free several gigabytes of VRAM, allowing all model layers onto the GPU. Only increase context length when your workload genuinely requires it.

Configure Parallel Request Handling

Ollama supports concurrent model instances, but each consumes additional VRAM. If multiple requests compete for GPU memory, performance drops for all of them:

# Limit to single concurrent request for maximum speed
OLLAMA_NUM_PARALLEL=1 ollama serve

# Or allow 2 parallel requests if VRAM permits
OLLAMA_NUM_PARALLEL=2 ollama serve

For production Ollama hosting, set this based on your GPU’s VRAM capacity and the model size. A single RTX 6000 Pro 96 GB can handle 2-3 parallel Llama 3.1 8B requests comfortably.

Enable Flash Attention

Newer Ollama versions support flash attention, which reduces memory usage and speeds up generation:

# Enable flash attention via environment variable
OLLAMA_FLASH_ATTENTION=1 ollama serve

Flash attention is particularly effective at longer context lengths and can improve throughput by 20-40% on supported architectures. Ensure your CUDA installation is up to date for compatibility.

Benchmark and Verify

# Run a generation and check the eval rate
curl -s http://localhost:11434/api/generate -d '{
  "model": "llama3.1:8b",
  "prompt": "Write a paragraph about cloud computing"
}' | grep -o '"eval_rate":[0-9.]*'

# Monitor GPU during generation
watch -n 0.5 nvidia-smi

Target eval rates: 8B models should hit 40-80 tok/s on an RTX 5090, 15-30 tok/s on an RTX 3090. If your numbers are far below these baselines, revisit layer offloading and VRAM allocation. For higher-throughput production deployments, vLLM offers continuous batching and PagedAttention. See the vLLM production setup guide and browse the tutorials for more optimization techniques. For PyTorch-based workloads, our PyTorch GPU installation guide covers the foundational setup.

Fast GPU Servers for Ollama

GigaGPU dedicated servers with RTX 5090, RTX 6000 Pro, and RTX 6000 Pro GPUs — full VRAM for maximum inference speed.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?