RTX 3050 - Order Now
Home / Blog / Tutorials / vLLM KV Cache Block Size Tuning
Tutorials

vLLM KV Cache Block Size Tuning

PagedAttention's block size controls memory fragmentation and throughput - the defaults are usually fine but not always.

vLLM’s PagedAttention allocates KV cache in fixed-size blocks, similar to virtual memory pages. Block size controls how efficiently concurrent sequences share KV memory on dedicated GPU servers. The default (16 tokens per block) is right for most workloads. It is not right for all.

Contents

What It Controls

KV cache is allocated in blocks of N tokens each. A block is the minimum granule of allocation. A sequence uses ceil(length/block) blocks. Smaller blocks = less fragmentation waste but more metadata overhead. Larger blocks = less metadata but more wasted memory when sequences are shorter than a full block.

Tradeoffs

Block SizeFragmentationMetadata OverheadPrefix Cache Granularity
8LowHigherFine
16 (default)ModerateNormalModerate
32Higher (esp. short seqs)LowerCoarse
64High for short seqsLowestVery coarse

When to Tune

You want a smaller block (8) when:

  • Your workload has many very short sequences (1-30 tokens output)
  • You are using prefix caching and want fine-grained cache hits

You want a larger block (32 or 64) when:

  • Most sequences are long (500+ tokens)
  • You care about maximum decode speed and can tolerate some fragmentation
  • VRAM is plentiful and internal fragmentation does not cost you real capacity

For typical chat workloads with mixed output lengths, leave it at 16.

Practical Values

# Default, works for most
--block-size 16

# Short-output heavy (classification, micro-responses)
--block-size 8

# Long-output (document generation, summarisation)
--block-size 32

vLLM Tuned to Your Sequence Distribution

Measured and tuned block-size configuration on UK dedicated hosting.

Browse GPU Servers

See continuous batching tuning and prefix caching.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?