Without chunked prefill, a 16,000-token RAG prompt arriving mid-serving can freeze decode for active users for hundreds of milliseconds. Chunked prefill splits the long prompt across multiple forward passes so decode continues between chunks. On dedicated GPU servers this is the single most important feature for mixed chat-plus-RAG workloads.
Contents
Why It Matters
Regular prefill is atomic per request. A 16k prompt takes one big forward pass that blocks every other sequence. Chunked prefill processes 2048 or 4096 tokens at a time; between chunks, decode for other sequences continues. Time-to-first-token for the long prompt stays similar but p99 inter-token latency for other users stays low.
Configuration
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--enable-chunked-prefill \
--max-num-batched-tokens 2048
Enabling chunked prefill implies --max-num-batched-tokens becomes the prefill chunk cap. 2048 is a good default. 1024 helps for stricter decode latency requirements. 4096 improves prefill throughput at the cost of some decode jitter.
Sizing the Chunk
| Workload | Chunk Size |
|---|---|
| Pure chat (short prompts) | Not needed |
| Mixed chat + occasional long prompts | 2048 |
| RAG with long context | 2048-4096 |
| Document processing only | 8192+ (decode latency not a goal) |
Interactions
Chunked prefill interacts with a few other features:
- Prefix caching: fully compatible; the two combined are ideal.
- Speculative decoding: works but performance gains narrow during prefill chunks.
- Multi-LoRA (vLLM LoRA support): compatible.
- Tensor parallel: compatible; each chunk crosses the interconnect as usual.
Low-Latency LLM Serving
We tune chunked prefill and batching for your specific mix of short and long prompts.
Browse GPU ServersSee continuous batching and prefix caching.