Prefill is the phase where the model reads the prompt before generating the first token. It’s compute-bound and usually the TTFT bottleneck. Numbers on the RTX 5060 Ti 16GB at our hosting:
Contents
Setup
- vLLM 0.6.4 with
max-tokens=1to isolate prefill - Metric: input tokens per second
By Model
| Model | Precision | Prefill t/s |
|---|---|---|
| Phi-3-mini | FP8 | 14,000 |
| Llama 3.2 3B | FP8 | 11,500 |
| Mistral 7B | FP8 | 7,200 |
| Llama 3.1 8B | FP8 | 6,800 |
| Gemma 2 9B | FP8 | 5,400 |
| Qwen 2.5 14B | AWQ INT4 | 2,100 |
By Prompt Length (Llama 3.1 8B FP8)
| Prompt | Prefill time | TTFT impact |
|---|---|---|
| 128 tok | 19 ms | +19 ms |
| 512 tok | 75 ms | +75 ms |
| 2,048 tok | 301 ms | +301 ms |
| 8,192 tok | 1,205 ms | +1,205 ms |
| 32,768 tok | 4,820 ms | +4,820 ms |
Prefill scales nearly linearly with prompt length below 8k; quadratically (attention cost) above.
Implications
- For short prompts (<1k): prefill is negligible, decode dominates TTFT
- For long prompts (8k+): prefill dominates – enable prefix caching or chunked prefill
- RAG: Retrieved passages are usually 2-4k tokens – prefill is ~300-600 ms per query
- FP8 vs INT4: FP8 prefill is 2-3x faster because Blackwell’s FP8 tensor cores hit peak GEMM
Prefill-Optimised LLM Hosting
6,800 input t/s on Llama 3 8B FP8. UK dedicated hosting.
Order the RTX 5060 Ti 16GBSee also: decode benchmark, TTFT p99, long-context perf, prefix caching.