RTX 3050 - Order Now
Home / Blog / Tutorials / vLLM Chunked Prefill Configuration
Tutorials

vLLM Chunked Prefill Configuration

Chunked prefill keeps decode latency stable when big prompts arrive during active serving - the right config for mixed workloads.

Without chunked prefill, a 16,000-token RAG prompt arriving mid-serving can freeze decode for active users for hundreds of milliseconds. Chunked prefill splits the long prompt across multiple forward passes so decode continues between chunks. On dedicated GPU servers this is the single most important feature for mixed chat-plus-RAG workloads.

Contents

Why It Matters

Regular prefill is atomic per request. A 16k prompt takes one big forward pass that blocks every other sequence. Chunked prefill processes 2048 or 4096 tokens at a time; between chunks, decode for other sequences continues. Time-to-first-token for the long prompt stays similar but p99 inter-token latency for other users stays low.

Configuration

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --enable-chunked-prefill \
  --max-num-batched-tokens 2048

Enabling chunked prefill implies --max-num-batched-tokens becomes the prefill chunk cap. 2048 is a good default. 1024 helps for stricter decode latency requirements. 4096 improves prefill throughput at the cost of some decode jitter.

Sizing the Chunk

WorkloadChunk Size
Pure chat (short prompts)Not needed
Mixed chat + occasional long prompts2048
RAG with long context2048-4096
Document processing only8192+ (decode latency not a goal)

Interactions

Chunked prefill interacts with a few other features:

  • Prefix caching: fully compatible; the two combined are ideal.
  • Speculative decoding: works but performance gains narrow during prefill chunks.
  • Multi-LoRA (vLLM LoRA support): compatible.
  • Tensor parallel: compatible; each chunk crosses the interconnect as usual.

Low-Latency LLM Serving

We tune chunked prefill and batching for your specific mix of short and long prompts.

Browse GPU Servers

See continuous batching and prefix caching.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?