Table of Contents
PagedAttention is the algorithmic insight that made vLLM's throughput possible. Before vLLM, KV cache fragmentation wasted 60-80% of allocated memory. After, fragmentation is essentially zero. Understanding the algorithm helps you tune vLLM correctly.
Traditional KV cache: contiguous per-request allocation. Wastes memory due to internal + external fragmentation. PagedAttention: KV cache split into fixed-size blocks (default 16 tokens), allocated on-demand via a page table. Eliminates fragmentation. Enables 2-4× effective KV cache utilisation, which directly translates to 2-4× concurrent serving capacity.
The problem
Pre-PagedAttention, LLM serving frameworks allocated KV cache contiguously per request. To handle a max-context request, you reserved max-context-worth of contiguous memory. Reality: most requests are far shorter than max context. Net: 60-80% of allocated KV memory unused.
Worse, when requests finish, freed memory is fragmented. New large requests can't fit even when total free memory is sufficient.
PagedAttention
PagedAttention applies operating-system virtual memory paging to KV cache:
- KV cache split into fixed-size blocks (default 16 tokens worth of K + V)
- Each request gets a page table mapping logical sequence positions to physical blocks
- Blocks allocated on-demand as sequences grow
- Blocks freed back to global pool when sequences finish
- Attention kernels rewritten to walk page tables
Result: zero fragmentation, on-demand allocation, easy KV reuse via shared page tables (the foundation of prefix caching).
Impact
- 2-4× effective KV cache utilisation — directly more concurrent serving capacity
- Prefix caching — multiple requests with shared prefix point at the same physical KV blocks
- Beam search efficiency — multiple beams share KV blocks for the common prefix
- Sequence forking / parallel sampling — cheap via shared blocks
Verdict
PagedAttention is the algorithm that made high-throughput LLM serving practical on consumer GPUs. Understanding it helps you tune --block-size and reason about VRAM budget. For most production deployments, vLLM's defaults are right; understanding the algorithm helps when defaults aren't.
Bottom line
PagedAttention = vLLM's throughput enabler. See block size tuning.