Table of Contents
What Is the KV Cache?
Every time an LLM generates a token on your dedicated GPU server, it needs to reference all previous tokens in the sequence. Without caching, the model would recompute attention over every prior token at each step, making inference quadratically expensive.
The KV (Key-Value) cache stores the key and value tensors from each attention layer for every token generated so far. This converts the repeated computation into a simple memory lookup. The trade-off is VRAM: the cache grows linearly with sequence length and batch size. For a self-hosting walkthrough, read our self-hosted LLM guide. For teams running open-source LLMs in production, understanding KV cache behaviour is essential for capacity planning.
How KV Cache Consumes VRAM
KV cache size depends on four factors: number of layers, number of attention heads, head dimension, and sequence length. The formula is:
KV cache per token = 2 x num_layers x num_kv_heads x head_dim x bytes_per_element
# For Llama 3 8B (FP16):
# 2 x 32 layers x 8 KV heads x 128 dim x 2 bytes = 131,072 bytes = 128 KB per token
At 4,096 tokens context, that is 512 MB per request in FP16. With a batch of 8 concurrent requests, KV cache alone consumes 4 GB — a significant chunk of an RTX 3090’s 24 GB. This is often the hidden reason teams run out of VRAM despite their model being well within the GPU’s capacity.
VRAM Calculations by Model Size
Here is how KV cache VRAM scales across common models at different batch sizes (4,096 token context, FP16 KV cache).
| Model | KV per token | Batch 1 (4K ctx) | Batch 8 (4K ctx) | Batch 16 (4K ctx) |
|---|---|---|---|---|
| Llama 3 8B | 128 KB | 0.5 GB | 4 GB | 8 GB |
| Llama 3 13B | 200 KB | 0.8 GB | 6.4 GB | 12.8 GB |
| Llama 3 70B | 640 KB | 2.5 GB | 20 GB | 40 GB |
| Mixtral 8x7B | 256 KB | 1 GB | 8 GB | 16 GB |
For 70B models, KV cache at batch 8 exceeds the VRAM of most single GPUs. This is why large model serving often requires multi-GPU clusters. Use our LLM cost calculator to estimate total memory needs.
KV Cache Management Techniques
Quantised KV cache. Storing KV values in INT8 or FP8 instead of FP16 halves cache memory. vLLM supports this with --kv-cache-dtype fp8. Quality impact is minimal for most models.
Sliding window attention. Models like Mistral use sliding window attention with a fixed window size (e.g., 4,096 tokens). The KV cache only stores tokens within the window, capping memory usage regardless of total context length.
Grouped Query Attention (GQA). Llama 3 uses GQA, which shares KV heads across multiple query heads. This reduces KV cache by 4x compared to standard multi-head attention, which is why Llama 3 8B only needs 8 KV heads instead of 32. See our best GPU for inference guide for model-specific hardware recommendations.
Context length limits. Setting --max-model-len in vLLM caps the maximum sequence length, directly limiting peak KV cache size. Choose a length that matches your application needs.
vLLM and PagedAttention
The traditional approach allocates a contiguous VRAM block for each request’s maximum possible KV cache. This wastes memory when requests use shorter sequences. PagedAttention, the core innovation in vLLM, manages KV cache like virtual memory pages.
# Enable PagedAttention with memory controls
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3-8B \
--gpu-memory-utilization 0.90 \
--max-model-len 4096 \
--kv-cache-dtype fp8
PagedAttention allocates small fixed-size blocks and maps them to requests dynamically. This eliminates fragmentation and enables near-perfect VRAM utilisation. Read our vLLM memory optimisation guide for complete configuration details.
Practical Tuning for Production
Start by calculating your VRAM budget. On a 24 GB GPU with a 4-bit 8B model (~4.5 GB weights), you have roughly 19 GB available after setting --gpu-memory-utilization 0.90 (21.6 GB usable minus model weights).
With 17 GB available for KV cache in FP8 (64 KB per token per request), you can serve approximately 16 concurrent requests at 4,096 token context. Adjust --max-num-seqs to match this calculation.
Monitor KV cache usage by watching VRAM consumption during peak load via nvidia-smi. If VRAM hits 95%+, reduce max batch size or context length. For monitoring setup, see our GPU monitoring guide. For batch size effects on throughput, review the batch size impact analysis.
Understanding KV cache is fundamental to running efficient LLM inference on dedicated GPU hosting. Get the memory planning right and you unlock significantly higher throughput from the same hardware.
GPU Servers Sized for Your KV Cache Needs
From 8 GB to multi-GPU configurations, GigaGPU has the VRAM you need. UK-hosted dedicated servers for production LLM inference.
Browse GPU Servers