RTX 3050 - Order Now
Home / Blog / AI Hosting & Infrastructure / Prefill / Decode Disaggregation
AI Hosting & Infrastructure

Prefill / Decode Disaggregation

Splitting prefill and decode onto different GPUs — emerging pattern for high-throughput LLM serving at scale.

For high-throughput LLM serving at scale, the emerging pattern is to disaggregate prefill (compute-bound, large parallel batches) from decode (memory-bound, small sequential steps). Different GPUs specialised for each phase. Throughput improvements substantial; ops complexity increases meaningfully.

TL;DR

Prefill on compute-heavy GPUs (4090 / H100); decode on bandwidth-heavy GPUs (5090 / H200). KV cache transferred between phases. Throughput improves ~30-50% on production workloads. Available in vLLM 0.7+ via experimental flag, mature in TensorRT-LLM and SGLang. Worth it for: high-volume production at multi-GPU scale.

Why disaggregate

Prefill and decode have very different compute / memory characteristics:

  • Prefill: forward pass on the entire input sequence. Compute-bound. Benefits from large parallel batches.
  • Decode: forward pass on one new token at a time. Memory-bound (must read entire weights for each step). Doesn't benefit from large batches in the same way.

Running both on the same GPU: prefill bursts compete with decode for resources; latency variability rises. Splitting: each GPU specialised; prefill GPU runs hot batched; decode GPU runs continuous low-latency.

How it works

  1. Request lands on router
  2. Router sends to prefill GPU pool; prefill computes initial KV cache
  3. KV cache transferred (over NVLink / fast interconnect) to decode GPU pool
  4. Decode GPU streams output tokens, using the KV cache
  5. Decode GPU returns response to router; router returns to client

When worth it

  • High-volume production: ops complexity earns its keep at scale
  • Multi-GPU already: incremental complexity over existing multi-GPU is moderate
  • Long-context workloads: prefill dominates; disaggregation helps most
  • Mixed prefill / decode workloads: latency variance reduction matters
  • Don't use for: SMB single-GPU deployments; complexity isn't earned

Verdict

Prefill / decode disaggregation is the cutting edge of LLM serving optimisation in 2026. Throughput gains are real; ops complexity is real. For SMB deployments, single-GPU vLLM with continuous batching is right. For datacenter-scale serving, disaggregation is increasingly the standard. Watch this space; the patterns will mature into broader vLLM support over the next 12 months.

Bottom line

Disaggregate at datacenter scale; not earned for SMB. See TP vs PP.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?