vLLM’s PagedAttention allocates KV cache in fixed-size blocks, similar to virtual memory pages. Block size controls how efficiently concurrent sequences share KV memory on dedicated GPU servers. The default (16 tokens per block) is right for most workloads. It is not right for all.
Contents
What It Controls
KV cache is allocated in blocks of N tokens each. A block is the minimum granule of allocation. A sequence uses ceil(length/block) blocks. Smaller blocks = less fragmentation waste but more metadata overhead. Larger blocks = less metadata but more wasted memory when sequences are shorter than a full block.
Tradeoffs
| Block Size | Fragmentation | Metadata Overhead | Prefix Cache Granularity |
|---|---|---|---|
| 8 | Low | Higher | Fine |
| 16 (default) | Moderate | Normal | Moderate |
| 32 | Higher (esp. short seqs) | Lower | Coarse |
| 64 | High for short seqs | Lowest | Very coarse |
When to Tune
You want a smaller block (8) when:
- Your workload has many very short sequences (1-30 tokens output)
- You are using prefix caching and want fine-grained cache hits
You want a larger block (32 or 64) when:
- Most sequences are long (500+ tokens)
- You care about maximum decode speed and can tolerate some fragmentation
- VRAM is plentiful and internal fragmentation does not cost you real capacity
For typical chat workloads with mixed output lengths, leave it at 16.
Practical Values
# Default, works for most
--block-size 16
# Short-output heavy (classification, micro-responses)
--block-size 8
# Long-output (document generation, summarisation)
--block-size 32
vLLM Tuned to Your Sequence Distribution
Measured and tuned block-size configuration on UK dedicated hosting.
Browse GPU ServersSee continuous batching tuning and prefix caching.