Table of Contents
vLLM’s --enable-prefix-caching flag is one line of config that buys you 30–50% more throughput on chat workloads. Most teams running vLLM in production don’t have it on; many don’t know it exists. This page explains why it matters.
Prefix caching reuses KV-cache state for repeated prompt prefixes (system prompts, RAG contexts, few-shot examples). For typical chatbot workloads with shared system prompts, the cache hit rate is 70–90% and aggregate throughput improves by 30–50%. Free, just enable it.
How prefix caching works
Each token in an LLM prompt builds a key/value tensor in attention. Normally, every prompt re-computes the full prefix even if 90% of it is identical to previous prompts. vLLM’s prefix caching:
- Hashes the prompt prefix block-by-block (16 tokens per block by default)
- Stores the resulting KV state in a hash-keyed pool
- On the next request, looks up matching prefix hashes and reuses the cached KV directly
- Only computes the suffix (the part that differs from cached prefixes)
For a chatbot with a 1,500-token system prompt and 50-token user input, you save ~97% of the prefill computation on cache hits.
When it helps the most
- Chatbots with shared system prompts — biggest win. Cache hit rates 80–95%.
- RAG with stable retrieved documents — high hit rate when the same chunks come up across queries.
- Few-shot prompted classifiers — identical few-shot examples on every request.
- Multi-turn conversations — each turn extends the previous; prefix caching makes turn-N latency roughly equal to turn-1.
When it does not help
- Random / unique prompts — embeddings indexing, classification of unrelated short texts.
- Dynamically templated prompts where the early tokens vary (e.g., timestamps in the system prompt — bad practice anyway).
- Memory-tight deployments — the cache uses VRAM. On a 16 GB card serving a 7B FP16 model, you may need to disable it under load.
Throughput numbers
| Workload | Without prefix caching | With prefix caching | Uplift |
|---|---|---|---|
| Chatbot, 1.5K system prompt | 720 tok/s | 1,180 tok/s | +64% |
| RAG, 3K context | 480 tok/s | 720 tok/s | +50% |
| Multi-turn, turn 5 | ~480 ms TTFT | ~150 ms TTFT | -69% latency |
| Random short prompts | 950 tok/s | 950 tok/s | 0% |
RTX 5090 + Mistral 7B FP8. Cache hit rates vary by traffic pattern.
Verdict
For any workload with repeated prompt prefixes (which is most production workloads), enable prefix caching. It’s free throughput. The only cost is VRAM — and on 24+ GB cards, almost negligible.
Bottom line
Add --enable-prefix-caching to your vLLM launch line. Watch vllm:gpu_prefix_cache_hit_rate_perc in your metrics — anything above 60% is a win. Combine with speculative decoding for the biggest combined uplift.