RTX 3050 - Order Now
Home / Blog / Tutorials / vLLM Prefix Caching: How It Works and Why It’s Free Throughput
Tutorials

vLLM Prefix Caching: How It Works and Why It’s Free Throughput

Prefix caching is the single highest-leverage tuning flag in vLLM. Here is how it works, when it helps, and the throughput uplift you should expect.

vLLM’s --enable-prefix-caching flag is one line of config that buys you 30–50% more throughput on chat workloads. Most teams running vLLM in production don’t have it on; many don’t know it exists. This page explains why it matters.

TL;DR

Prefix caching reuses KV-cache state for repeated prompt prefixes (system prompts, RAG contexts, few-shot examples). For typical chatbot workloads with shared system prompts, the cache hit rate is 70–90% and aggregate throughput improves by 30–50%. Free, just enable it.

How prefix caching works

Each token in an LLM prompt builds a key/value tensor in attention. Normally, every prompt re-computes the full prefix even if 90% of it is identical to previous prompts. vLLM’s prefix caching:

  1. Hashes the prompt prefix block-by-block (16 tokens per block by default)
  2. Stores the resulting KV state in a hash-keyed pool
  3. On the next request, looks up matching prefix hashes and reuses the cached KV directly
  4. Only computes the suffix (the part that differs from cached prefixes)

For a chatbot with a 1,500-token system prompt and 50-token user input, you save ~97% of the prefill computation on cache hits.

When it helps the most

  • Chatbots with shared system prompts — biggest win. Cache hit rates 80–95%.
  • RAG with stable retrieved documents — high hit rate when the same chunks come up across queries.
  • Few-shot prompted classifiers — identical few-shot examples on every request.
  • Multi-turn conversations — each turn extends the previous; prefix caching makes turn-N latency roughly equal to turn-1.

When it does not help

  • Random / unique prompts — embeddings indexing, classification of unrelated short texts.
  • Dynamically templated prompts where the early tokens vary (e.g., timestamps in the system prompt — bad practice anyway).
  • Memory-tight deployments — the cache uses VRAM. On a 16 GB card serving a 7B FP16 model, you may need to disable it under load.

Throughput numbers

WorkloadWithout prefix cachingWith prefix cachingUplift
Chatbot, 1.5K system prompt720 tok/s1,180 tok/s+64%
RAG, 3K context480 tok/s720 tok/s+50%
Multi-turn, turn 5~480 ms TTFT~150 ms TTFT-69% latency
Random short prompts950 tok/s950 tok/s0%

RTX 5090 + Mistral 7B FP8. Cache hit rates vary by traffic pattern.

Verdict

For any workload with repeated prompt prefixes (which is most production workloads), enable prefix caching. It’s free throughput. The only cost is VRAM — and on 24+ GB cards, almost negligible.

Bottom line

Add --enable-prefix-caching to your vLLM launch line. Watch vllm:gpu_prefix_cache_hit_rate_perc in your metrics — anything above 60% is a win. Combine with speculative decoding for the biggest combined uplift.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?