Chunked prefill is vLLM’s scheduler feature that mixes prefill and decode work inside a single forward pass. On the RTX 5060 Ti 16GB at our dedicated GPU hosting, it massively improves tail latency for concurrent chat when one user pastes a long prompt.
Contents
The Problem
Without chunked prefill, vLLM alternates between prefill batches and decode batches. If user A sends a 32k-token prompt while users B-E are mid-stream decoding, the prefill blocks the decode batch for hundreds of milliseconds. Every active user sees a stall.
On a small GPU like the 5060 Ti this hurts doubly – prefill is already slow because compute is tight, so these stalls are visible.
How It Works
Chunked prefill splits prefill into fixed-size chunks (default 512 tokens). Each scheduler step processes one prefill chunk plus any decode steps that fit in the budget. Decode throughput stays continuous; prefill completes over multiple forward passes instead of one giant one.
Trade-off: total prefill time for the long request is slightly higher (more scheduler overhead), but p99 decode latency for other users drops sharply.
Enabling
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--quantization fp8 \
--enable-chunked-prefill \
--max-num-batched-tokens 2048 \
--max-model-len 32768 \
--gpu-memory-utilization 0.90
Key knobs:
--max-num-batched-tokens: total token budget per forward pass. 2048 is a good default for 16 GB; drop to 1024 if VRAM is tight.- Chunked prefill is on by default in vLLM 0.6+ when
max_model_lenis high – explicit flag just guarantees it.
Measured Impact
8 concurrent users doing 2,000-token chat; one user periodically sends a 16,000-token prompt. Llama 3.1 8B FP8 on 5060 Ti 16GB:
| Metric | No chunked prefill | With chunked prefill | Delta |
|---|---|---|---|
| p50 TTFT (short prompts) | 180 ms | 200 ms | +11% |
| p99 TTFT (short prompts) | 4,200 ms | 380 ms | -91% |
| p50 decode latency | 12 ms | 14 ms | +17% |
| p99 decode latency | 980 ms | 45 ms | -95% |
| Aggregate throughput | 360 t/s | 390 t/s | +8% |
| Long prompt full prefill | 1,400 ms | 1,650 ms | +18% |
Short-prompt users see vastly smoother experience. The long-prompt user pays a small prefill tax. Net win unless you are serving one batch-1 user with giant prompts – in which case disable it.
Interactions
- With prefix caching: complementary – cached blocks skip prefill entirely, chunked prefill smooths the non-cached remainder.
- With speculative decoding: chunked prefill takes precedence; speculative work is deferred until decode batch settles.
- With FP8 KV cache: independent – free to stack both.
- With long context (128k): essential – a 64k prefill without chunking freezes the server.
Recommendation: enable chunked prefill on any vLLM deployment serving more than one concurrent user. The p99 improvement is worth the small p50 regression on short prompts.
Smooth Concurrent LLM Serving
Chunked prefill eliminates the tail-latency spikes. UK dedicated hosting.
Order the RTX 5060 Ti 16GBSee also: context budget, FP8 Llama deployment, chatbot hosting.