Table of Contents
Semantic caching is one of the highest-ROI cost levers for production LLM workloads. The idea is simple: embed each query, look for similar past queries, return cached response if similarity is high enough. Hit rates of 20-40% are common on real-world workloads; each hit saves ~£0.20-12/M tokens depending on hosted vs self-hosted.
Pipeline: embed query → vector search past queries → if cosine similarity > 0.95, return cached response. Use BGE-large for embeddings, Redis or Qdrant for cache. Threshold tuning is the critical parameter — start at 0.95 + manual review of false positives. Hit rate: 20-40% typical for chatbots / FAQ; higher for narrow domains.
How it works
- Receive query at API gateway
- Embed query via BGE-large (~30 ms)
- Vector search recent cache entries
- If best match cosine similarity > threshold: return cached response immediately
- Else: generate via LLM, then cache (query embedding, response, timestamp)
- Periodic eviction: TTL-based (24 hours typical) + LRU on size pressure
Design
- Embedding model: BGE-large or BGE-m3 for multilingual
- Storage: Redis with vector search module, or Qdrant for larger caches
- Similarity threshold: 0.95 default; tune via manual review
- TTL: 24 hours typical; longer for stable knowledge, shorter for time-sensitive
- Cache key includes: tenant_id, model_version, prompt_template_version (so cache is correctly partitioned)
Hit rate
Typical hit rates by workload:
- FAQ / customer support: 30-50% (high repeat questions)
- General chatbot: 20-30%
- RAG over docs: 15-25% (queries vary more)
- Code generation: 10-15% (queries highly varied)
- Data extraction: 40-60% (similar input docs cluster)
Verdict
Semantic caching is one of the cheapest, highest-ROI features you can add to a production AI deployment. Implementation is ~half-day; hit rate of 20-40% directly translates to cost saving. Combine with vLLM's prefix caching for compounding wins.
Bottom line
Semantic cache = 20-40% cost saving. See prefix caching.