RTX 3050 - Order Now
Home / Blog / LLM Hosting / KV Cache Explained: Why It Eats Your VRAM
LLM Hosting

KV Cache Explained: Why It Eats Your VRAM

Understanding how the KV cache works in LLM inference, why it consumes so much VRAM, and practical techniques to manage it on dedicated GPU servers.

What Is the KV Cache?

Every time an LLM generates a token on your dedicated GPU server, it needs to reference all previous tokens in the sequence. Without caching, the model would recompute attention over every prior token at each step, making inference quadratically expensive.

The KV (Key-Value) cache stores the key and value tensors from each attention layer for every token generated so far. This converts the repeated computation into a simple memory lookup. The trade-off is VRAM: the cache grows linearly with sequence length and batch size. For a self-hosting walkthrough, read our self-hosted LLM guide. For teams running open-source LLMs in production, understanding KV cache behaviour is essential for capacity planning.

How KV Cache Consumes VRAM

KV cache size depends on four factors: number of layers, number of attention heads, head dimension, and sequence length. The formula is:

KV cache per token = 2 x num_layers x num_kv_heads x head_dim x bytes_per_element

# For Llama 3 8B (FP16):
# 2 x 32 layers x 8 KV heads x 128 dim x 2 bytes = 131,072 bytes = 128 KB per token

At 4,096 tokens context, that is 512 MB per request in FP16. With a batch of 8 concurrent requests, KV cache alone consumes 4 GB — a significant chunk of an RTX 3090’s 24 GB. This is often the hidden reason teams run out of VRAM despite their model being well within the GPU’s capacity.

VRAM Calculations by Model Size

Here is how KV cache VRAM scales across common models at different batch sizes (4,096 token context, FP16 KV cache).

ModelKV per tokenBatch 1 (4K ctx)Batch 8 (4K ctx)Batch 16 (4K ctx)
Llama 3 8B128 KB0.5 GB4 GB8 GB
Llama 3 13B200 KB0.8 GB6.4 GB12.8 GB
Llama 3 70B640 KB2.5 GB20 GB40 GB
Mixtral 8x7B256 KB1 GB8 GB16 GB

For 70B models, KV cache at batch 8 exceeds the VRAM of most single GPUs. This is why large model serving often requires multi-GPU clusters. Use our LLM cost calculator to estimate total memory needs.

KV Cache Management Techniques

Quantised KV cache. Storing KV values in INT8 or FP8 instead of FP16 halves cache memory. vLLM supports this with --kv-cache-dtype fp8. Quality impact is minimal for most models.

Sliding window attention. Models like Mistral use sliding window attention with a fixed window size (e.g., 4,096 tokens). The KV cache only stores tokens within the window, capping memory usage regardless of total context length.

Grouped Query Attention (GQA). Llama 3 uses GQA, which shares KV heads across multiple query heads. This reduces KV cache by 4x compared to standard multi-head attention, which is why Llama 3 8B only needs 8 KV heads instead of 32. See our best GPU for inference guide for model-specific hardware recommendations.

Context length limits. Setting --max-model-len in vLLM caps the maximum sequence length, directly limiting peak KV cache size. Choose a length that matches your application needs.

vLLM and PagedAttention

The traditional approach allocates a contiguous VRAM block for each request’s maximum possible KV cache. This wastes memory when requests use shorter sequences. PagedAttention, the core innovation in vLLM, manages KV cache like virtual memory pages.

# Enable PagedAttention with memory controls
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3-8B \
  --gpu-memory-utilization 0.90 \
  --max-model-len 4096 \
  --kv-cache-dtype fp8

PagedAttention allocates small fixed-size blocks and maps them to requests dynamically. This eliminates fragmentation and enables near-perfect VRAM utilisation. Read our vLLM memory optimisation guide for complete configuration details.

Practical Tuning for Production

Start by calculating your VRAM budget. On a 24 GB GPU with a 4-bit 8B model (~4.5 GB weights), you have roughly 19 GB available after setting --gpu-memory-utilization 0.90 (21.6 GB usable minus model weights).

With 17 GB available for KV cache in FP8 (64 KB per token per request), you can serve approximately 16 concurrent requests at 4,096 token context. Adjust --max-num-seqs to match this calculation.

Monitor KV cache usage by watching VRAM consumption during peak load via nvidia-smi. If VRAM hits 95%+, reduce max batch size or context length. For monitoring setup, see our GPU monitoring guide. For batch size effects on throughput, review the batch size impact analysis.

Understanding KV cache is fundamental to running efficient LLM inference on dedicated GPU hosting. Get the memory planning right and you unlock significantly higher throughput from the same hardware.

GPU Servers Sized for Your KV Cache Needs

From 8 GB to multi-GPU configurations, GigaGPU has the VRAM you need. UK-hosted dedicated servers for production LLM inference.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?