RTX 3050 - Order Now
Home / Blog / Tutorials / vLLM PagedAttention Explained
Tutorials

vLLM PagedAttention Explained

PagedAttention is the algorithm that makes vLLM's KV cache management efficient. The intuition, the implementation, the impact.

PagedAttention is the algorithmic insight that made vLLM's throughput possible. Before vLLM, KV cache fragmentation wasted 60-80% of allocated memory. After, fragmentation is essentially zero. Understanding the algorithm helps you tune vLLM correctly.

TL;DR

Traditional KV cache: contiguous per-request allocation. Wastes memory due to internal + external fragmentation. PagedAttention: KV cache split into fixed-size blocks (default 16 tokens), allocated on-demand via a page table. Eliminates fragmentation. Enables 2-4× effective KV cache utilisation, which directly translates to 2-4× concurrent serving capacity.

The problem

Pre-PagedAttention, LLM serving frameworks allocated KV cache contiguously per request. To handle a max-context request, you reserved max-context-worth of contiguous memory. Reality: most requests are far shorter than max context. Net: 60-80% of allocated KV memory unused.

Worse, when requests finish, freed memory is fragmented. New large requests can't fit even when total free memory is sufficient.

PagedAttention

PagedAttention applies operating-system virtual memory paging to KV cache:

  • KV cache split into fixed-size blocks (default 16 tokens worth of K + V)
  • Each request gets a page table mapping logical sequence positions to physical blocks
  • Blocks allocated on-demand as sequences grow
  • Blocks freed back to global pool when sequences finish
  • Attention kernels rewritten to walk page tables

Result: zero fragmentation, on-demand allocation, easy KV reuse via shared page tables (the foundation of prefix caching).

Impact

  • 2-4× effective KV cache utilisation — directly more concurrent serving capacity
  • Prefix caching — multiple requests with shared prefix point at the same physical KV blocks
  • Beam search efficiency — multiple beams share KV blocks for the common prefix
  • Sequence forking / parallel sampling — cheap via shared blocks

Verdict

PagedAttention is the algorithm that made high-throughput LLM serving practical on consumer GPUs. Understanding it helps you tune --block-size and reason about VRAM budget. For most production deployments, vLLM's defaults are right; understanding the algorithm helps when defaults aren't.

Bottom line

PagedAttention = vLLM's throughput enabler. See block size tuning.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?