Table of Contents
For high-throughput LLM serving at scale, the emerging pattern is to disaggregate prefill (compute-bound, large parallel batches) from decode (memory-bound, small sequential steps). Different GPUs specialised for each phase. Throughput improvements substantial; ops complexity increases meaningfully.
Prefill on compute-heavy GPUs (4090 / H100); decode on bandwidth-heavy GPUs (5090 / H200). KV cache transferred between phases. Throughput improves ~30-50% on production workloads. Available in vLLM 0.7+ via experimental flag, mature in TensorRT-LLM and SGLang. Worth it for: high-volume production at multi-GPU scale.
Why disaggregate
Prefill and decode have very different compute / memory characteristics:
- Prefill: forward pass on the entire input sequence. Compute-bound. Benefits from large parallel batches.
- Decode: forward pass on one new token at a time. Memory-bound (must read entire weights for each step). Doesn't benefit from large batches in the same way.
Running both on the same GPU: prefill bursts compete with decode for resources; latency variability rises. Splitting: each GPU specialised; prefill GPU runs hot batched; decode GPU runs continuous low-latency.
How it works
- Request lands on router
- Router sends to prefill GPU pool; prefill computes initial KV cache
- KV cache transferred (over NVLink / fast interconnect) to decode GPU pool
- Decode GPU streams output tokens, using the KV cache
- Decode GPU returns response to router; router returns to client
When worth it
- High-volume production: ops complexity earns its keep at scale
- Multi-GPU already: incremental complexity over existing multi-GPU is moderate
- Long-context workloads: prefill dominates; disaggregation helps most
- Mixed prefill / decode workloads: latency variance reduction matters
- Don't use for: SMB single-GPU deployments; complexity isn't earned
Verdict
Prefill / decode disaggregation is the cutting edge of LLM serving optimisation in 2026. Throughput gains are real; ops complexity is real. For SMB deployments, single-GPU vLLM with continuous batching is right. For datacenter-scale serving, disaggregation is increasingly the standard. Watch this space; the patterns will mature into broader vLLM support over the next 12 months.
Bottom line
Disaggregate at datacenter scale; not earned for SMB. See TP vs PP.