RTX 3050 - Order Now
Home / Blog / Tutorials / vLLM High Latency: Reducing Time to First Token
Tutorials

vLLM High Latency: Reducing Time to First Token

Reduce vLLM time to first token (TTFT) and inter-token latency. Covers prefill optimization, batch scheduling, model warm-up, and infrastructure tuning for responsive LLM serving.

The Latency Problem

Your vLLM server produces correct responses, but the time to first token (TTFT) is unacceptable. Users stare at a blank screen for two, five, or even ten seconds before text starts streaming. For interactive applications, TTFT above one second feels sluggish, and above three seconds feels broken.

TTFT in vLLM is the time from receiving the request to generating the first output token. It is dominated by the prefill phase — where the model processes the entire input prompt. Longer prompts mean longer TTFT. Higher server load compounds the problem.

Measuring Your Current Latency

# Measure TTFT for a single request
import time
import requests

start = time.time()
response = requests.post(
    "http://localhost:8000/v1/completions",
    json={"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
          "prompt": "Explain quantum computing in simple terms.",
          "max_tokens": 100, "stream": True},
    stream=True
)
for chunk in response.iter_lines():
    if chunk:
        ttft = time.time() - start
        print(f"TTFT: {ttft:.3f}s")
        break

Run this at different concurrency levels to understand how TTFT degrades under load on your GPU server.

Optimization 1: Reduce Effective Prompt Length

Prefill time is roughly proportional to prompt token count. Shorter prompts mean faster TTFT:

  • Trim system prompts to essentials. A 2000-token system prompt adds significant prefill overhead to every request.
  • Use concise conversation history. Summarise earlier turns instead of including full transcripts.
  • Set max-model-len to the minimum your application needs — this prevents accidentally processing very long inputs.

Optimization 2: Enable Prefix Caching

--enable-prefix-caching

If all requests share a common system prompt, prefix caching computes the KV values for that prompt once and reuses them. The repeated portion of the prompt adds zero prefill time after the first request. For chatbot applications with a 500-token system prompt, this can cut TTFT by 40-60 percent.

Optimization 3: Configure Chunked Prefill

--enable-chunked-prefill \
--max-num-batched-tokens 2048

Chunked prefill breaks long prompt processing into smaller chunks, interleaving them with decode steps from other requests. This prevents a single long-prompt request from blocking all other requests. It increases overall fairness but does not reduce TTFT for the long-prompt request itself.

Optimization 4: Model Warm-Up

The first request after startup always has higher TTFT because CUDA kernels need to be compiled and caches must be populated:

# Send a warm-up request after server start
curl -s http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"meta-llama/Meta-Llama-3.1-8B-Instruct","prompt":"warmup","max_tokens":1}' \
  > /dev/null

Include this in your startup script or systemd service as a post-start hook. Our vLLM production guide includes a complete systemd unit with warm-up.

Optimization 5: Hardware-Level Improvements

TTFT is ultimately limited by the GPU’s compute throughput during the prefill phase:

  • RTX 6000 Pro vs RTX 5090: The RTX 6000 Pro has higher memory bandwidth (2 TB/s vs 1 TB/s), which directly reduces prefill time for memory-bound operations.
  • Tensor parallelism: Splitting the model across 2 GPUs roughly halves prefill time: --tensor-parallel-size 2
  • PCIe bandwidth: Gen5 reduces inter-GPU communication latency compared to Gen4 during tensor-parallel prefill.

If your current GPU server does not meet latency requirements after software optimization, hardware is the remaining lever. Check our benchmarks section for GPU performance comparisons.

Optimization 6: Reduce Network Overhead

Time measured at the client includes network round-trip. Ensure your application server is on the same network as the GPU server, or minimise hops:

  • Run the application on the same machine as vLLM when possible.
  • Use Unix sockets instead of TCP for local connections.
  • For Nginx proxied setups, ensure proxy buffering is disabled for streaming: proxy_buffering off;

Our API infrastructure guide covers reverse proxy configuration for low-latency streaming.

Verification: Before and After

# Benchmark with vLLM's built-in tool
python -m vllm.benchmarks.benchmark_serving \
  --backend openai \
  --model meta-llama/Meta-Llama-3.1-8B-Instruct \
  --num-prompts 200 \
  --request-rate 20 \
  --input-len 512 \
  --output-len 128

Compare the TTFT percentiles (P50, P90, P99) before and after each optimization. Monitor over time with your GPU monitoring setup. For the full memory and throughput optimization picture, see our vLLM optimization guide.

Low-Latency GPU Servers

GigaGPU offers RTX 6000 Pro and high-bandwidth GPU servers designed for interactive AI inference with minimal latency.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?