RTX 3050 - Order Now
Home / Blog / LLM Hosting / KV Cache vs Model Quantization: What to Compress
LLM Hosting

KV Cache vs Model Quantization: What to Compress

Comparing KV cache compression and model weight quantisation for reducing LLM memory usage. When to compress the cache, when to quantise the model, and how combining both maximises throughput.

Quick Verdict: KV Cache vs Model Quantization

Model quantisation reduces the static memory footprint of model weights. KV cache compression reduces the dynamic memory that grows with each concurrent request and sequence length. A 70B model at INT4 uses 38GB of fixed VRAM. Its KV cache at 4K context per user consumes an additional 2GB per concurrent request in FP16. With 20 concurrent users, the KV cache alone uses 40GB. On dedicated GPU hosting, compressing both the model and the KV cache unlocks the highest concurrent user counts per GPU.

Understanding the Memory Split

GPU VRAM during LLM inference is split between model weights (static) and KV cache (dynamic). Model weights load once and remain constant. The KV cache stores key-value attention states for every active sequence and grows linearly with both context length and concurrent users.

For short conversations (under 2K tokens), model weights dominate VRAM usage. For long contexts (8K-32K tokens) with multiple concurrent users, the KV cache overtakes model weights as the primary memory consumer. Understanding this split is essential for configuring vLLM deployments. See our production setup guide for configuration details.

Memory Breakdown (70B Model, RTX 6000 Pro 96 GB)

ConfigurationModel VRAMKV Cache per User (4K ctx)Max Concurrent Users
FP16 model + FP16 cache140 GB (2 GPUs)2.0 GB~10 (on 2x 80GB)
INT4 model + FP16 cache38 GB2.0 GB~20
INT4 model + FP8 cache38 GB1.0 GB~40
INT4 model + INT4 cache38 GB0.5 GB~80

Quality Impact

Model quantisation from FP16 to INT4 produces a permanent 1-3% quality loss across all requests. Every token generated passes through quantised weights. KV cache quantisation from FP16 to FP8 produces less than 0.5% quality loss because attention patterns are inherently more tolerant of precision reduction. INT4 KV cache shows 1-2% quality loss at long contexts. Check benchmarks for quality comparisons at different precision levels.

The practical implication: compressing the KV cache is nearly free in quality terms while dramatically increasing concurrent capacity. Model quantisation is a stronger intervention with more noticeable quality trade-offs. Always compress the KV cache first. Review the benchmarks section for detailed perplexity measurements.

When to Compress What

Quantise the model when: The model does not fit in VRAM at full precision, or you want to run a larger model on fewer GPUs from the GPU selection guide. This addresses the static memory problem.

Compress the KV cache when: You need more concurrent users, longer context windows, or are already running a quantised model and need more headroom. vLLM supports FP8 KV cache natively on multi-GPU clusters.

Compress both when: You are maximising throughput per GPU dollar on private AI hosting. INT4 model + FP8 KV cache is the current production sweet spot for high-concurrency LLM hosting.

Recommendation

Start with INT4 model quantisation (AWQ) to fit your target model. Then enable FP8 KV cache in vLLM for maximum concurrency with minimal quality loss. Only go to INT4 KV cache if you need 50+ concurrent users per GPU. Deploy this configuration on GigaGPU dedicated servers and follow the vLLM production guide. Explore the infrastructure blog for memory optimisation strategies.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?