RTX 3050 - Order Now
Home / Blog / LLM Hosting / LLM Prompt Caching: Reduce Compute
LLM Hosting

LLM Prompt Caching: Reduce Compute

Reduce LLM compute costs with prompt caching. Covers prefix caching in vLLM, KV cache reuse, system prompt deduplication, semantic caching, and cache invalidation strategies on GPU servers.

The Same System Prompt Burns GPU Cycles Every Request

Every request to your LLM API includes a 2000-token system prompt. With 1000 requests per hour, your GPU processes 2 million redundant system prompt tokens — the same computation repeated identically each time. On an dedicated GPU server, those wasted cycles could serve additional users or reduce latency. Prompt caching eliminates this redundancy by reusing computed attention states across requests that share common prefixes.

vLLM Automatic Prefix Caching

vLLM has built-in prefix caching that detects repeated prompt prefixes and reuses their KV cache entries:

# Enable prefix caching when launching vLLM
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Meta-Llama-3.1-8B-Instruct \
  --enable-prefix-caching \
  --gpu-memory-utilization 0.90

# How it works:
# Request 1: System prompt (2000 tokens) + User query A
#   -> Computes KV cache for all 2000 + query tokens
# Request 2: System prompt (2000 tokens) + User query B
#   -> Reuses KV cache for the shared 2000-token prefix
#   -> Only computes KV for the new user query tokens

# Impact on a typical chatbot with 1500-token system prompt:
# Without caching: TTFT ~250ms per request
# With caching:    TTFT ~80ms for subsequent requests (3x faster)

# Verify caching is working via metrics
curl http://localhost:8000/metrics | grep cache
# vllm:cache_hit_rate should be > 0 after warm-up

Designing Prompts for Cache Efficiency

Cache hits require exact prefix matches. Structure your prompts to maximise shared prefixes:

# BAD: dynamic content at the start breaks prefix matching
messages = [
    {"role": "system", "content": f"Date: {today}. You are a helpful assistant..."},
    {"role": "user", "content": user_query}
]
# The date changes daily, invalidating the cache every day

# GOOD: static system prompt first, dynamic content at the end
messages = [
    {"role": "system", "content": "You are a helpful assistant for AcmeTech. "
     "Answer questions about products and billing. Be concise and professional."},
    {"role": "system", "content": f"Context: Date={today}, User plan={plan}"},
    {"role": "user", "content": user_query}
]
# The first system message is always identical — cached across ALL requests

# BEST: group users by shared context for maximum cache reuse
# All free-tier users share one system prompt variant
# All enterprise users share another
# This creates two cache entries instead of one per user

Semantic Caching for Repeated Queries

For questions asked frequently, cache the full response instead of recomputing:

import hashlib, json, redis, numpy as np
from sentence_transformers import SentenceTransformer

r = redis.Redis()
embedder = SentenceTransformer("all-MiniLM-L6-v2")

class SemanticCache:
    def __init__(self, similarity_threshold=0.92):
        self.threshold = similarity_threshold
        self.embeddings = []
        self.keys = []

    def get(self, query):
        query_emb = embedder.encode(query)

        # Check for semantically similar cached queries
        for i, cached_emb in enumerate(self.embeddings):
            similarity = np.dot(query_emb, cached_emb) / (
                np.linalg.norm(query_emb) * np.linalg.norm(cached_emb))
            if similarity >= self.threshold:
                cached = r.get(self.keys[i])
                if cached:
                    return json.loads(cached)
        return None

    def set(self, query, response, ttl=3600):
        key = f"llm_cache:{hashlib.md5(query.encode()).hexdigest()}"
        r.set(key, json.dumps(response), ex=ttl)
        self.embeddings.append(embedder.encode(query))
        self.keys.append(key)

cache = SemanticCache(similarity_threshold=0.92)

# In request handler
cached = cache.get(user_query)
if cached:
    return cached  # Skip GPU entirely
response = await call_llm(user_query)
cache.set(user_query, response)

Multi-Turn Conversation Caching

Cache intermediate conversation states to avoid reprocessing entire histories:

# Without caching: each turn reprocesses ALL previous messages
# Turn 5 processes: system + turn1 + turn2 + turn3 + turn4 + turn5
# That is O(n^2) total tokens across a conversation

# With vLLM prefix caching: automatic for continuous sessions
# Turn 5 only computes: new turn5 tokens
# Previous turns are already in the KV cache

# For disconnected sessions, implement manual cache warming:
async def warm_cache(conversation_history):
    """Send a dummy request with the conversation prefix to warm the cache."""
    await call_llm(
        messages=conversation_history,
        max_tokens=1  # Generate minimal output
    )

# When a user reconnects, warm the cache before they send a message
await warm_cache(previous_conversation_messages)

Measuring Cache Effectiveness

# Track cache metrics
import time
from prometheus_client import Counter, Histogram

cache_hits = Counter('prompt_cache_hits', 'Prefix cache hits')
cache_misses = Counter('prompt_cache_misses', 'Prefix cache misses')
ttft = Histogram('time_to_first_token', 'TTFT distribution')

# Compare TTFT with and without caching
# Cached requests should show 2-5x lower TTFT
# Overall GPU utilisation should drop for the same request volume

# vLLM metrics endpoint shows cache statistics
# curl http://localhost:8000/metrics | grep prefix_cache

Prompt caching is one of the highest-impact optimisations for vLLM deployments on your GPU server. The vLLM production guide covers server flags. See the LLM hosting section for deployment architecture, benchmarks for throughput impact, and our tutorials for end-to-end setup. Ollama users benefit from similar KV cache reuse internally.

Efficient LLM Inference

Prefix caching on GigaGPU servers cuts time-to-first-token by 3x. Serve more users with the same GPU.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?