Table of Contents
Overview: What Each Tool Optimises For
vLLM and Ollama are frequently mentioned in the same breath, but they were built to solve different problems and the wrong choice will cost you either developer hours or production headroom. vLLM is a high-throughput inference server engineered around PagedAttention and continuous batching, designed to keep a GPU saturated while serving many simultaneous requests through an OpenAI-compatible API. It assumes you have an Ampere-class or newer NVIDIA GPU, that the model fits in VRAM, and that you are happy living inside a Python and CUDA stack. Ollama, by contrast, optimises for developer convenience: a single static binary, GGUF model files pulled by a one-line command, automatic CPU fallback, and a friendly REPL. It is built on top of llama.cpp and inherits its strengths (broad hardware support, tight quantisation) and its weakness (effectively single-stream throughput).
Put plainly: vLLM is what you reach for when you need to serve a chatbot, agent backend, or RAG system to dozens or hundreds of concurrent users; Ollama is what you reach for when you want a model running on your laptop or a small internal tool in under five minutes. The throughput delta on the same hardware is roughly an order of magnitude under multi-user load, but Ollama wins the wall-clock race from “fresh box” to “first token” by a similar margin. This guide settles the choice with concrete numbers, real setup recipes, and a decision matrix you can apply to your own workload. If you are still choosing hardware, our best GPU for LLM inference piece is the right precursor.
Architecture Differences
The architectural gap between the two systems is the root cause of every benchmark difference downstream. vLLM is a PyTorch and CUDA application. It loads model weights into GPU memory in their native format (typically FP16, BF16, or quantised AWQ/GPTQ/FP8), then services requests through a scheduler that uses PagedAttention — a KV cache organised into fixed-size blocks the way an OS organises virtual memory. The scheduler can interleave prefill and decode phases of many requests in the same forward pass, a technique called continuous batching, and it can share KV blocks across requests with identical prefixes (prefix caching). On a multi-GPU box vLLM also supports tensor parallelism, splitting each layer across two, four, or eight GPUs to fit larger models or boost throughput further.
Ollama wraps llama.cpp, a C++ project that uses the GGML/GGUF tensor library. GGUF is a single-file model format that bundles weights, tokeniser, and metadata, with aggressive quantisation options (Q4_K_M, Q5_K_M, Q8_0) that produce smaller files than the AWQ or GPTQ formats vLLM consumes. The runtime is built to gracefully degrade: if a layer does not fit in VRAM, it lives in system RAM and is computed on the CPU. There is no PagedAttention; KV cache is contiguous per request, which is simple and fast for one stream but wasteful when many requests are in flight. The default scheduler serialises requests once a small concurrency threshold is exceeded, so throughput effectively caps at single-stream speed for any meaningful batch.
| Dimension | vLLM | Ollama |
|---|---|---|
| Backend | PyTorch + custom CUDA kernels | llama.cpp / GGML |
| Model format | HF safetensors, AWQ, GPTQ, FP8 | GGUF (Q4_K_M, Q5_K_M, Q8_0, FP16) |
| KV cache | Paged, block-based, shared on prefix | Contiguous per-request |
| Batching | Continuous, interleaved prefill/decode | Effectively single-stream |
| Multi-GPU | Tensor and pipeline parallel | Layer-split across GPUs only |
| CPU fallback | No | Yes (graceful) |
| API | OpenAI-compatible, native | Custom + OpenAI-compatible shim |
Throughput Comparison: Real Numbers
Numbers below are measured on UK GigaGPU stock with the same prompt set (a mix of short chat turns averaging 180 input tokens and 220 output tokens) and the same model weights where format permits. vLLM is run with continuous batching at the stated concurrency; Ollama is run with its default scheduler and the same number of concurrent client connections. Aggregate throughput is the sum across all in-flight requests; single-stream latency is the time to last token for a lone request with a warm cache. For colour on hardware sizing, see Llama 3 VRAM requirements and the RTX 4090 spec breakdown.
| Model | GPU | Engine | Concurrency | Aggregate tok/s | Per-stream tok/s |
|---|---|---|---|---|---|
| Llama 3.1 8B FP16 | RTX 4090 24GB | vLLM | 32 | 1,820 | 57 |
| Llama 3.1 8B FP16 | RTX 4090 24GB | vLLM | 1 | 105 | 105 |
| Llama 3.1 8B Q4_K_M | RTX 4090 24GB | Ollama | 32 | 112 | 3.5 |
| Llama 3.1 8B Q4_K_M | RTX 4090 24GB | Ollama | 1 | 96 | 96 |
| Llama 3.1 8B FP16 | RTX 5060 Ti 16GB | vLLM | 16 | 720 | 45 |
| Llama 3.1 8B Q4_K_M | RTX 5060 Ti 16GB | Ollama | 1 | 78 | 78 |
| Llama 3.1 8B FP16 | RTX 3090 24GB | vLLM | 32 | 1,310 | 41 |
| Llama 3.1 8B Q4_K_M | RTX 3090 24GB | Ollama | 1 | 82 | 82 |
| Llama 3.1 70B AWQ | RTX 4090 24GB | vLLM | 8 | 185 | 23 |
| Llama 3.1 70B Q4_K_M | RTX 4090 24GB | Ollama | 1 | 14 | 14 |
Read the table as a story about utilisation. At concurrency 1 the gap is small — Ollama on Q4 is even slightly faster on the 4090 because its quantised weights are smaller and the per-token compute is lighter. The moment you have more than one user, the curves diverge violently: vLLM scales aggregate throughput nearly linearly with batch size up to the GPU’s compute ceiling, while Ollama’s aggregate barely improves and per-stream collapses. The 70B row is the most telling: a single 4090 running Llama 70B AWQ under vLLM serves eight users at 23 tokens per second each — perfectly usable for chat — whereas Ollama on the same card delivers 14 tokens per second to one user only. Concurrent-user behaviour is explored further in our concurrent users analysis.
Resource Utilisation and VRAM Behaviour
The two engines treat the GPU like fundamentally different resources. vLLM allocates its KV cache pool at startup based on the --gpu-memory-utilization flag (default 0.9), pinning a large slab of VRAM for the duration of the process. This makes capacity planning predictable: if vLLM starts, it will not OOM later under load, because every potential KV block is already accounted for. The trade-off is that the process appears to “waste” memory when idle — nvidia-smi will show 22 GB used on a 24 GB card even at 3 a.m. with no traffic.
Ollama loads models on demand and unloads them after a configurable idle timeout (default 5 minutes). VRAM usage is bursty and proportional to the number of distinct models recently used. For a single model this is efficient, but if you switch between three models every few minutes you will pay the load cost (5 to 30 seconds) repeatedly. GPU utilisation under sustained load is also very different: vLLM holds the SMs at 90 to 98 percent during decode, with brief dips during prefill rebalancing; Ollama oscillates between 100 percent during a single request and 0 percent between requests, never overlapping work. For watching either of these in practice, our guide on monitoring GPU usage is the practical companion.
| Metric | vLLM | Ollama |
|---|---|---|
| VRAM at startup (Llama 3.1 8B, 4090) | 22.0 GB (90% pool) | 5.6 GB (Q4_K_M) |
| VRAM under load, 32 concurrent | 22.0 GB (steady) | 6.8 GB (cache grows) |
| GPU utilisation, sustained | 92 to 98% | 30 to 60% bursty |
| Cold start to first token | 120 to 300 s | 3 to 8 s |
| Idle behaviour | Holds VRAM, near-zero compute | Unloads after 5 min |
| Effective KV cache packing | High (paged blocks) | Low (contiguous per request) |
Setup Recipes Side by Side
The setup gap is where Ollama’s appeal is undeniable. A working Ollama install on a fresh Ubuntu 22.04 box is a single line; a working vLLM install is a Docker compose file plus a Hugging Face token plus model selection. Both recipes below are validated on UK GigaGPU stock with a recent NVIDIA driver already in place — for the driver and CUDA prep, see our PyTorch GPU server install walk-through and the full RTX 4090 vLLM setup guide.
Ollama in three lines:
curl -fsSL https://ollama.com/install.sh | sh
ollama serve &
ollama run llama3.1:8b
That is it. The first run will pull a roughly 4.7 GB GGUF file, drop you into an interactive prompt, and expose an HTTP API on localhost:11434 with both the native /api/generate endpoint and an OpenAI-compatible /v1/chat/completions shim. To make it listen on all interfaces for an internal service, set OLLAMA_HOST=0.0.0.0:11434 before starting the daemon and put a reverse proxy in front of it.
vLLM via docker-compose:
version: "3.9"
services:
vllm:
image: vllm/vllm-openai:v0.6.3
container_name: vllm
runtime: nvidia
restart: unless-stopped
environment:
- HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
- HF_HUB_ENABLE_HF_TRANSFER=1
ports:
- "8000:8000"
volumes:
- ./hf-cache:/root/.cache/huggingface
ipc: host
command: >
--model meta-llama/Meta-Llama-3.1-8B-Instruct
--tensor-parallel-size 1
--max-model-len 16384
--gpu-memory-utilization 0.90
--enable-prefix-caching
--enable-chunked-prefill
--kv-cache-dtype fp8
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
The flags matter and each one earns its place. --max-model-len 16384 caps context at 16K to keep KV cache bounded; raise it only if your workload needs it. --gpu-memory-utilization 0.90 reserves 90 percent of VRAM for weights plus KV; the remaining 10 percent absorbs CUDA workspace and headroom. --enable-prefix-caching shares KV blocks across requests with identical system prompts — a significant win for chatbots and RAG. --enable-chunked-prefill interleaves long prefills with ongoing decodes so a single user with a 12K-token prompt does not stall everyone else. --kv-cache-dtype fp8 halves KV memory on Ada and Hopper, doubling effective batch capacity. Ada-specific tuning lives in our FP8 Llama deployment and AWQ quantisation guides.
When to Use Which: Decision Matrix
The decision is rarely “best engine in the abstract” — it is “best engine for this workload on this budget.” The matrix below collapses the most common scenarios into a single recommendation. “Either” means both will serve adequately and your choice should hinge on the team’s familiarity rather than throughput.
| Use case | Recommendation | Why |
|---|---|---|
| Developer workstation, exploring models | Ollama | Five-second model switches, no infra to manage |
| Single-user personal chat assistant | Ollama | Single-stream tok/s is competitive, simpler stack |
| Internal team tool, 5 to 20 occasional users | Either | Ollama if traffic is sparse; vLLM if bursts are concurrent |
| Public-facing API, sustained concurrent traffic | vLLM | 10x aggregate throughput, predictable VRAM |
| Multi-tenant SaaS with shared prompts | vLLM | Prefix caching and continuous batching are essential |
| Overnight batch generation jobs | vLLM | Saturates the GPU, finishes faster, costs less per token |
| Agent backend with parallel tool calls | vLLM | Multiple in-flight calls share the same model efficiently |
| Edge device or laptop with no NVIDIA GPU | Ollama | Only option with CPU and Apple Silicon support |
| Quick proof-of-concept demo for a stakeholder | Ollama | Up and running before the calendar invite ends |
If you are doing the cost arithmetic on whether to self-host at all, our cost per million tokens analysis and the self-hosting break-even piece are the right next reads — the throughput numbers above feed directly into those models, and the conclusion almost always favours vLLM the moment your traffic is steady.
Common Gotchas and Pitfalls
Both engines have sharp edges. The most common production incidents we see on customer servers are not bugs in the software but mismatches between what an operator expected and how the engine actually behaves under load.
| Engine | Gotcha | Mitigation |
|---|---|---|
| vLLM | Cold start of 2 to 5 minutes for large models | systemd unit with restart-on-failure; never SIGKILL |
| vLLM | No GGUF support — must use HF safetensors, AWQ, GPTQ or FP8 | Convert or pull pre-quantised AWQ from Hugging Face |
| vLLM | KV cache pool eats almost all VRAM at startup | Lower --gpu-memory-utilization if running other CUDA processes |
| vLLM | Version sensitivity — flags change between minor releases | Pin the image tag, read release notes before upgrading |
| vLLM | Long prompts can stall others without chunked prefill | Always enable --enable-chunked-prefill in production |
| Ollama | Concurrent requests serialise once the queue fills | Use OLLAMA_NUM_PARALLEL but expect throughput ceiling |
| Ollama | AMD ROCm support is patchy across cards | Stick to NVIDIA or accept CPU fallback |
| Ollama | No PagedAttention — VRAM less efficient at scale | Do not try to push past 4 to 8 concurrent users on one model |
| Ollama | Model unloads after idle timeout, costing 5 to 30 s on next request | Set OLLAMA_KEEP_ALIVE=-1 for always-resident |
| Ollama | OpenAI shim does not implement every parameter | Test client libraries against the shim before committing |
The Hybrid Pattern: Ollama Dev, vLLM Prod
The pattern that keeps both teams happy is to run Ollama on developer laptops and dev servers and vLLM on production infrastructure, exposing both behind the same OpenAI-compatible interface so application code is identical. Use the same model name on both sides — for example Llama 3.1 8B Instruct — even though the underlying file format differs (GGUF on Ollama, AWQ or FP8 safetensors on vLLM). Build a thin client wrapper that points at http://localhost:11434/v1 in development and https://llm.internal.example.com/v1 in production, and you can ship code that works on a developer’s MacBook and on a multi-GPU server without modification.
The only real constraint of this pattern is sampling parity. Different quantisation levels and different KV cache dtypes do shift output distributions slightly, so do not write deterministic tests that depend on exact token sequences across the two engines. Use evaluation harnesses that compare semantic quality rather than string equality. For a deeper walk-through of self-hosting both ends of this pipeline, see the self-host LLM guide and our dedicated GPU hosting overview.
Verdict
Use Ollama for development, exploration, and any single-user or sparsely-trafficked internal tool — its setup speed and convenience are unmatched and the throughput is genuinely sufficient for one person at a time. Use vLLM the moment you are serving concurrent users, building an API, or running batch jobs where throughput maps directly to cost; the order-of-magnitude aggregate gain pays for the extra setup work within the first day of real traffic. The two are not rivals so much as adjacent tools, and the most mature teams run both: Ollama on the laptop, vLLM on the GPU server.
Ready to put vLLM into production on UK-hosted hardware? Spin up a dedicated GPU in minutes from the GigaGPU portal store — RTX 4090 24GB and RTX 5060 Ti 16GB stock is on tap, with full root access, no per-token billing, and the same network used to validate every benchmark in this guide.