Prefix caching (also called automatic prefix caching, APC) reuses prefilled KV cache blocks across requests that share a common prefix. On the RTX 5060 Ti 16GB via our dedicated GPU hosting, this can eliminate 80-95% of prefill cost when you run a fixed system prompt across many user messages.
Contents
How It Works
vLLM hashes each prefilled KV block (default 16 tokens) by its content. When a new request arrives, vLLM walks the prefix, hashes block-by-block, and if the hash exists in cache, reuses those GPU-resident KV blocks instead of recomputing. The cache is LRU and bounded by free GPU memory.
Prefill is the expensive phase of LLM serving on a small GPU – compute-bound, scales O(prompt_len). Skipping it for cached prefixes means first-token latency drops from seconds to milliseconds for the cached portion.
Enabling Prefix Caching
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--quantization fp8 \
--enable-prefix-caching \
--max-model-len 16384 \
--gpu-memory-utilization 0.90
That’s it – one flag. Prefix cache uses whatever VRAM is free after model weights and running-sequence KV. No configuration needed for most setups.
Realistic Wins
Measured on Llama 3.1 8B FP8, batch 1, 5060 Ti 16GB. 2 kB system prompt, 200-token user query:
| Scenario | TTFT no cache | TTFT with cache | Speed-up |
|---|---|---|---|
| First request (cold) | 280 ms | 280 ms | 1.0x |
| Second request (warm) | 280 ms | 40 ms | 7.0x |
| 8 kB system prompt, warm | 1,100 ms | 60 ms | 18x |
| Multi-turn chat, turn 5 | 420 ms | 50 ms | 8.4x |
| Full RAG context, warm | 1,800 ms | 90 ms | 20x |
Multi-turn chat is a particularly good fit because each turn appends to the previous turn’s context – the entire conversation history is cached on turn N+1.
Patterns That Benefit
- Fixed system prompt across users. Branded AI assistants, customer support bots, role-play characters. One big upfront prompt, many variations after it.
- Multi-turn conversations. Each subsequent turn reuses the KV for the entire prior conversation.
- RAG with static contexts. If retrieved passages repeat across queries (common in documentation Q&A), the shared passages stay cached.
- Few-shot prompting. Fixed in-context examples at the start of every prompt are cached once, hit always after.
Trade-offs
- Cache memory competes with running-sequence KV. On 16 GB with ~5 GB free after weights, cache fits ~10 MB per block x hundreds of blocks – plenty for most system prompts.
- Cache is LRU; very high prompt diversity means low hit rate.
- No cross-session persistence by default – restart loses the cache. For multi-hour sessions this is fine; for 24/7 production, warm it proactively on boot.
- No downside if hit rate is 0 – vLLM just computes normally. Enable by default on any chat-style workload.
vLLM 0.6+ made APC effectively free – no measurable overhead on cache-miss paths. Recommendation: enable on every vLLM deployment that ships the standard FP8 Llama config or uses chatbot workloads.
Prefix-Cache-Enabled LLM Hosting
Turn multi-turn chat TTFT from 400 ms to 50 ms. UK dedicated hosting.
Order the RTX 5060 Ti 16GBSee also: chunked prefill, speculative decoding, FP8 KV cache, context budget, RAG pipeline.