Home / Blog / Tutorials / vLLM Prefix Caching: How It Works and Why It’s Free Throughput

Tutorials

vLLM Prefix Caching: How It Works and Why It’s Free Throughput

Prefix caching is the single highest-leverage tuning flag in vLLM. Here is how it works, when it helps, and the throughput uplift you should expect.

Tutorials May 5, 2026 2 min read gigagpu

Table of Contents

vLLM’s --enable-prefix-caching flag is one line of config that buys you 30–50% more throughput on chat workloads. Most teams running vLLM in production don’t have it on; many don’t know it exists. This page explains why it matters.

TL;DR

Prefix caching reuses KV-cache state for repeated prompt prefixes (system prompts, RAG contexts, few-shot examples). For typical chatbot workloads with shared system prompts, the cache hit rate is 70–90% and aggregate throughput improves by 30–50%. Free, just enable it.

How prefix caching works

Each token in an LLM prompt builds a key/value tensor in attention. Normally, every prompt re-computes the full prefix even if 90% of it is identical to previous prompts. vLLM’s prefix caching:

Hashes the prompt prefix block-by-block (16 tokens per block by default)
Stores the resulting KV state in a hash-keyed pool
On the next request, looks up matching prefix hashes and reuses the cached KV directly
Only computes the suffix (the part that differs from cached prefixes)

For a chatbot with a 1,500-token system prompt and 50-token user input, you save ~97% of the prefill computation on cache hits.

When it helps the most

Chatbots with shared system prompts — biggest win. Cache hit rates 80–95%.
RAG with stable retrieved documents — high hit rate when the same chunks come up across queries.
Few-shot prompted classifiers — identical few-shot examples on every request.
Multi-turn conversations — each turn extends the previous; prefix caching makes turn-N latency roughly equal to turn-1.

When it does not help

Random / unique prompts — embeddings indexing, classification of unrelated short texts.
Dynamically templated prompts where the early tokens vary (e.g., timestamps in the system prompt — bad practice anyway).
Memory-tight deployments — the cache uses VRAM. On a 16 GB card serving a 7B FP16 model, you may need to disable it under load.

Throughput numbers

Workload	Without prefix caching	With prefix caching	Uplift
Chatbot, 1.5K system prompt	720 tok/s	1,180 tok/s	+64%
RAG, 3K context	480 tok/s	720 tok/s	+50%
Multi-turn, turn 5	~480 ms TTFT	~150 ms TTFT	-69% latency
Random short prompts	950 tok/s	950 tok/s	0%

RTX 5090 + Mistral 7B FP8. Cache hit rates vary by traffic pattern.

Verdict

For any workload with repeated prompt prefixes (which is most production workloads), enable prefix caching. It’s free throughput. The only cost is VRAM — and on 24+ GB cards, almost negligible.

Bottom line

Add --enable-prefix-caching to your vLLM launch line. Watch vllm:gpu_prefix_cache_hit_rate_perc in your metrics — anything above 60% is a win. Combine with speculative decoding for the biggest combined uplift.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Tutorials

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

vLLM Prefix Caching: How It Works and Why It’s Free Throughput

How prefix caching works

When it helps the most

When it does not help

Throughput numbers

Verdict

Bottom line

Need a Dedicated GPU Server?

gigagpu

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

vLLM Prefix Caching: How It Works and Why It’s Free Throughput

How prefix caching works

When it helps the most

When it does not help

Throughput numbers

Verdict

Bottom line

Need a Dedicated GPU Server?

gigagpu

Related Articles

Migrate from Azure OpenAI to Dedicated GPU: Copilot Integration Guide

RVC Voice Cloning on a GPU Server

Monitoring GPU Usage on a Dedicated Server: Tools, Metrics, and Alerts

Migrate from Anthropic to Self-Hosted: Document Analysis Guide

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?