If your prompts share prefixes – a system prompt, a few-shot template, a retrieved context that multiple users query – vLLM’s prefix caching can cut prefill time by 60-90%. On dedicated GPU servers the wall-clock improvement is often the difference between a chat feeling instant and feeling sluggish.
Contents
What It Does
When a prompt begins with tokens vLLM has already processed, the engine reuses the stored KV cache for those tokens instead of recomputing. Prefill work drops proportionally. For a 4000-token system prompt shared across many users, each query only pays prefill cost for the user’s 200-token addendum.
Enabling It
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--enable-prefix-caching \
--gpu-memory-utilization 0.92
The cache lives in GPU KV memory, so prefix caching consumes some of the memory you would otherwise spend on concurrency. The tradeoff is usually favourable.
Typical Gains
| Workload | Prefix Cache Hit Rate | Prefill Speed-up |
|---|---|---|
| Chat with fixed system prompt | ~60-80% | Up to 3x |
| RAG with repeated retrievals | 30-50% | 1.5-2x |
| Few-shot with fixed examples | ~80-90% | 3-5x |
| Unique prompts | Near 0% | No gain |
| Agent tool use (shared chain) | 40-70% | 2-3x |
Limits
Cache entries are keyed by exact token prefix. A one-token difference at position 100 invalidates everything from that point. Keep shared prefixes genuinely identical – do not sprinkle dynamic data (timestamps, user IDs) into the system prompt where caching would otherwise help.
Cache occupies KV cache VRAM. On a 16 GB card serving Llama 3 8B, you might lose ~1-2 GB of concurrency capacity. On a 96 GB 6000 Pro the cost is negligible.
Prefix Caching Tuned for Your Prompts
We help structure system prompts and RAG pipelines to maximise cache hit rates on UK dedicated hosting.
Browse GPU ServersSee continuous batching tuning and SGLang vs vLLM where Radix attention extends this idea.