Quick Verdict: KV Cache vs Model Quantization
Model quantisation reduces the static memory footprint of model weights. KV cache compression reduces the dynamic memory that grows with each concurrent request and sequence length. A 70B model at INT4 uses 38GB of fixed VRAM. Its KV cache at 4K context per user consumes an additional 2GB per concurrent request in FP16. With 20 concurrent users, the KV cache alone uses 40GB. On dedicated GPU hosting, compressing both the model and the KV cache unlocks the highest concurrent user counts per GPU.
Understanding the Memory Split
GPU VRAM during LLM inference is split between model weights (static) and KV cache (dynamic). Model weights load once and remain constant. The KV cache stores key-value attention states for every active sequence and grows linearly with both context length and concurrent users.
For short conversations (under 2K tokens), model weights dominate VRAM usage. For long contexts (8K-32K tokens) with multiple concurrent users, the KV cache overtakes model weights as the primary memory consumer. Understanding this split is essential for configuring vLLM deployments. See our production setup guide for configuration details.
Memory Breakdown (70B Model, RTX 6000 Pro 96 GB)
| Configuration | Model VRAM | KV Cache per User (4K ctx) | Max Concurrent Users |
|---|---|---|---|
| FP16 model + FP16 cache | 140 GB (2 GPUs) | 2.0 GB | ~10 (on 2x 80GB) |
| INT4 model + FP16 cache | 38 GB | 2.0 GB | ~20 |
| INT4 model + FP8 cache | 38 GB | 1.0 GB | ~40 |
| INT4 model + INT4 cache | 38 GB | 0.5 GB | ~80 |
Quality Impact
Model quantisation from FP16 to INT4 produces a permanent 1-3% quality loss across all requests. Every token generated passes through quantised weights. KV cache quantisation from FP16 to FP8 produces less than 0.5% quality loss because attention patterns are inherently more tolerant of precision reduction. INT4 KV cache shows 1-2% quality loss at long contexts. Check benchmarks for quality comparisons at different precision levels.
The practical implication: compressing the KV cache is nearly free in quality terms while dramatically increasing concurrent capacity. Model quantisation is a stronger intervention with more noticeable quality trade-offs. Always compress the KV cache first. Review the benchmarks section for detailed perplexity measurements.
When to Compress What
Quantise the model when: The model does not fit in VRAM at full precision, or you want to run a larger model on fewer GPUs from the GPU selection guide. This addresses the static memory problem.
Compress the KV cache when: You need more concurrent users, longer context windows, or are already running a quantised model and need more headroom. vLLM supports FP8 KV cache natively on multi-GPU clusters.
Compress both when: You are maximising throughput per GPU dollar on private AI hosting. INT4 model + FP8 KV cache is the current production sweet spot for high-concurrency LLM hosting.
Recommendation
Start with INT4 model quantisation (AWQ) to fit your target model. Then enable FP8 KV cache in vLLM for maximum concurrency with minimal quality loss. Only go to INT4 KV cache if you need 50+ concurrent users per GPU. Deploy this configuration on GigaGPU dedicated servers and follow the vLLM production guide. Explore the infrastructure blog for memory optimisation strategies.